|
|
| Data Analyst Nanodegree, Udacity vamshi.krishna.prime@gmail.com |
|
2. Load Data
5. Credits
from IPython.display import Image
Image("img/Metro Bikeshare.jpg")
Image description: image of Metro Bike bikeshare.
Investigation Overview:¶The investigation of the dataset is focussed on factors that influence the bike rentals and the reforms that can be taken to improvize the bike rentals based on the customer preferences and hidden trends.
Dataset Overview:¶The bikeshare data consists of 3 years data related to bike type, ride type, customer pass type, ride timeline, distance along with geographical data. Other varibles like fare, distance_miles and fare type are feature engineered for deeper analysis. The data needed wrangling and cleaning operations which are fulfilled in the
ACT 1of the process. The data is stored in a relational database categorized into individual tables as per large data storage techniques.
Import libraries¶===========================
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
import seaborn as sb
from sqlalchemy import create_engine
%matplotlib inline
from matplotlib.lines import Line2D
import matplotlib.patches as patches
# suppress warnings from final output
import warnings
warnings.simplefilter("ignore")
Load Data¶=================
Flat File:Database:| Dataset | Available format | Description | Mode of access |
|---|---|---|---|
| bikeshare_clean | bikeshare_master.csv | A clean dateset in csv format | Load directly using read_csv method in pandas |
| bikeshare_clean | bikeshare_master.db | A relational database | Requires SQL query to gather data |
| Dataset | bike | time | fare | station |
|---|---|---|---|---|
| variable | trip_id | trip_id | trip_id | trip_id |
| variable | bike_id | start_time | trip_id | start_station_id |
| variable | trip_type | end_time | fare | start_lat |
| variable | bike_type | duration | fare_type | start_lon |
| variable | passholder_type | distance_miles | end_station_id | |
| variable | end_lat | |||
| variable | end_lon |
engine = create_engine('sqlite:///bikeshare_master.db')
# Import data from the database into a dataframe using SQL query
bikeshare = pd.read_sql('SELECT b.trip_id, \
b.bike_id, \
b.trip_type, \
b.bike_type, \
b.passholder_type AS pass_type, \
f.fare_type, \
t.start_time, \
t.end_time, \
t.duration AS duration_min, \
t.distance_miles, \
f.fare, \
s.start_station_id, \
s.start_lat, \
s.start_lon, \
s.end_station_id, \
s.end_lat, \
s.end_lon \
FROM bike AS b \
JOIN time AS t \
ON b.trip_id = t.trip_id \
JOIN fare AS f \
ON b.trip_id = f.trip_id \
JOIN station AS s \
ON t.trip_id = s.trip_id', engine)
Alternate approach is to load data from the flat file in CSV format.
bikeshare.info()
Not all columns retain their datatype information while retreving the dataset from the database. This is because of transition of data from one format/platform to another. The incorrect
column datatypesare to bemanually assigned.
level_order = ['One Way', 'Round Trip']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['trip_type'] = bikeshare['trip_type'].astype(ordered_cat)
level_order = ['unknown', 'Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['bike_type'] = bikeshare['bike_type'].astype(ordered_cat)
level_order = ['Walk-up', 'One Day', 'Monthly', 'Flex', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['pass_type'] = bikeshare['pass_type'].astype(ordered_cat)
level_order = ['Base', 'Extended']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
bikeshare['fare_type'] = bikeshare['fare_type'].astype(ordered_cat)
bikeshare['start_time'] = pd.to_datetime(bikeshare['start_time'])
bikeshare['end_time'] = pd.to_datetime(bikeshare['end_time'])
bikeshare.info()
Expand the dataset by extracting timeline variables for further plotting
The time series data related to rentals
hour/day/week/month/yearneeds to be prepared/extracted for further plotting.
%%time
# create a timeline variables from the existing data
bikeshare['year'] = bikeshare['start_time'].dt.year
bikeshare['month'] = bikeshare['start_time'].dt.month
bikeshare['weekday'] = bikeshare['start_time'].dt.weekday
bikeshare['day'] = bikeshare['start_time'].dt.day
bikeshare['hour'] = bikeshare['start_time'].dt.hour
bikeshare[['year', 'month', 'weekday', 'day', 'hour']].head()
Extract daytime from the hour column:
Extract
day_sectionfromhourcolumn.
# divide the hour of the day into customized sections
bin = [-1,5,11,16,20,23]
bikeshare['day_sections'] = pd.cut(bikeshare['start_time'].dt.hour,bin)
bikeshare['day_sections'].head(10)
Explore the various methods to extract the
sections of the daybased on thehourof the day. To calculate the method with most performance (less time to extract the values), take the first 1000 entries in the dataset and calculate the execution time.
%%capture --no-stdout
def apply_section(row):
if row in df_new.day_sections.unique()[0] :
return 'Early hours'
if row in df_new.day_sections.unique()[1] :
return 'Morning'
if row in df_new.day_sections.unique()[2] :
return 'Afternoon'
if row in df_new.day_sections.unique()[3] :
return 'Evening'
if row in df_new.day_sections.unique()[4] :
return 'Night'
return 'unknown'
def map_identity(row):
if row in df_new.day_sections.unique()[0] :
return 'Early hours'
if row in df_new.day_sections.unique()[1] :
return 'Morning'
if row in df_new.day_sections.unique()[2] :
return 'Afternoon'
if row in df_new.day_sections.unique()[3] :
return 'Evening'
if row in df_new.day_sections.unique()[4] :
return 'Night'
return 'unknown'
def map_identity2(row):
if row == df_new.day_sections.unique()[0] :
return 'Early hours'
if row == df_new.day_sections.unique()[1] :
return 'Morning'
if row == df_new.day_sections.unique()[2] :
return 'Afternoon'
if row == df_new.day_sections.unique()[3] :
return 'Evening'
if row == df_new.day_sections.unique()[4] :
return 'Night'
return 'unknown'
def mask_section(df):
df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[0], 'Early hours')
df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[1], 'Morning')
df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[2], 'Afternoon')
df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[3], 'Evening')
df['label4'] = df['day_sections'].mask(df.day_sections==df.day_sections.unique()[4], 'Night')
def npwhere_section(df):
df['label5'] = np.where(df.day_sections == df.day_sections.unique()[0], 'Early hours', df.day_sections)
df['label5'] = np.where(df.day_sections == df.day_sections.unique()[1], 'Morning', df.label5)
df['label5'] = np.where(df.day_sections == df.day_sections.unique()[2], 'Afternoon', df.label5)
df['label5'] = np.where(df.day_sections == df.day_sections.unique()[3], 'Evening', df.label5)
df['label5'] = np.where(df.day_sections == df.day_sections.unique()[4], 'Night', df.label5)
def loc_section(df):
df.loc[df['day_sections'] == df.day_sections.unique()[0],'label6'] = 'Early hours'
df.loc[df['day_sections'] == df.day_sections.unique()[1],'label6'] = 'Morning'
df.loc[df['day_sections'] == df.day_sections.unique()[2],'label6'] = 'Afternoon'
df.loc[df['day_sections'] == df.day_sections.unique()[3],'label6'] = 'Evening'
df.loc[df['day_sections'] == df.day_sections.unique()[4],'label6'] = 'Night'
df_new = bikeshare.head(1000).copy()
%time df_new['label1'] = df_new['hour'].apply(lambda row: apply_section(row))
%time df_new['label2'] = df_new['hour'].map(map_identity)
%time df_new['label3'] = df_new['day_sections'].map(map_identity2)
%time mask_section(df_new)
%time npwhere_section(df_new)
%time loc_section(df_new)
From the above, it is evident that
np.where,mapmethod and.locmethod (vectorized operations) yields the most performance. However on larger datasets,.locmethod perform better.
from IPython.display import Image
Image("img/performance chart.PNG", width = 600, height = 300)
It can be determined from the above steps that
.locmethod is the best solution to add new column by extracting/comparing values from the existing column.
Extract daytime from day_section.
%%time
def assign_daytime(df):
df.loc[df['day_sections'] == df.day_sections.unique()[0],'daytime'] = 'Early hours'
df.loc[df['day_sections'] == df.day_sections.unique()[1],'daytime'] = 'Morning'
df.loc[df['day_sections'] == df.day_sections.unique()[2],'daytime'] = 'Afternoon'
df.loc[df['day_sections'] == df.day_sections.unique()[3],'daytime'] = 'Evening'
df.loc[df['day_sections'] == df.day_sections.unique()[4],'daytime'] = 'Night'
assign_daytime(bikeshare)
bikeshare.daytime.value_counts()
As estimated,
.locmethod exhibited the best perormance by extracting thedaytimevalues from theday_sectionscoulmns with808589entriesaround 1 second.
# display a sample of 'daytime' entries for visual confirmation
bikeshare[['day_sections', 'daytime']].sample(10)
Change weekday representation:
change the
weekdayrepresentation from numeric values to descriptive values. Aforementioned, use.locmethod to extract new column from the existing column values.
| Integer Value | Day of the week |
|---|---|
| 0 | Monday |
| 1 | Tuesday |
| 2 | Wednesday |
| 3 | Thursday |
| 4 | Friday |
| 5 | Saturday |
| 6 | Sunday |
%%time
def assign_weekday(df):
df.loc[df['weekday'] == 0,'weekday'] = 'Monday'
df.loc[df['weekday'] == 1,'weekday'] = 'Tuesday'
df.loc[df['weekday'] == 2,'weekday'] = 'Wednesday'
df.loc[df['weekday'] == 3,'weekday'] = 'Thursday'
df.loc[df['weekday'] == 4,'weekday'] = 'Friday'
df.loc[df['weekday'] == 5,'weekday'] = 'Saturday'
df.loc[df['weekday'] == 6,'weekday'] = 'Sunday'
assign_weekday(bikeshare)
# display a sample of 'daytime' entries for visual confirmation
bikeshare[['weekday']].sample(10)
Extract the relative number of the week in a month:
Each month bears either
3or4weeks depending on the leap year and month itself. Extract the relative number of the week in each month.
bin = [0,7,14,21,28,31]
#use pd.cut function can attribute the values into its specific bins
bikeshare['week_sections'] = pd.cut(bikeshare['day'],bin)
bikeshare[['week_sections']].head()
bikeshare.week_sections.unique()
%%time
def assign_week(df):
df.loc[df['week_sections'] == df.week_sections.unique()[0],'week'] = 'First'
df.loc[df['week_sections'] == df.week_sections.unique()[1],'week'] = 'Second'
df.loc[df['week_sections'] == df.week_sections.unique()[2],'week'] = 'Third'
df.loc[df['week_sections'] == df.week_sections.unique()[3],'week'] = 'Fourth'
df.loc[df['week_sections'] == df.week_sections.unique()[4],'week'] = 'Fifth'
assign_week(bikeshare)
bikeshare.week.value_counts()
bikeshare[['week_sections', 'week']].sample(10)
Extract quarter of the year from the month column:
Extract
quarter_sectionsfrommonthcolumn.
# divide the hour of the day into customized sections
bin = [0,3,6,9,12]
#use pd.cut function to attribute the values into its specific bins
bikeshare['quarter_sections'] = pd.cut(bikeshare['start_time'].dt.month,bin)
bikeshare['quarter_sections'].sample(10)
Extract
quarterfromquarter_sections.
bikeshare.quarter_sections.unique()
%%time
def extract_quarter(df):
df.loc[df['quarter_sections'] == df.quarter_sections.unique()[0],'quarter'] = 'Q1'
df.loc[df['quarter_sections'] == df.quarter_sections.unique()[1],'quarter'] = 'Q2'
df.loc[df['quarter_sections'] == df.quarter_sections.unique()[2],'quarter'] = 'Q3'
df.loc[df['quarter_sections'] == df.quarter_sections.unique()[3],'quarter'] = 'Q4'
extract_quarter(bikeshare)
bikeshare.quarter.value_counts()
As estimated,
.locmethod exhibited the best perormance by extracting thequarterof the year values from theyear_sectionscoulmns with808589entriesunder 1 second.
# display a sample of 'quarter' entries for visual confirmation
bikeshare[['quarter_sections', 'quarter']].sample(10)
Change datatypes of multiple columns to ordered categorical dtype:
bikeshare.info()
df = bikeshare
level_order = ['Early hours', 'Morning', 'Afternoon', 'Evening', 'Night']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['daytime'] = df['daytime'].astype(ordered_cat)
level_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['weekday'] = df['weekday'].astype(ordered_cat)
level_order = ['First', 'Second', 'Third', 'Fourth', 'Fifth']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['week'] = df['week'].astype(ordered_cat)
level_order = ['Q1', 'Q2', 'Q3', 'Q4']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['quarter'] = df['quarter'].astype(ordered_cat)
bikeshare.info()
Remove redundant columns in the dataset:
cols_to_drop = ['day_sections', 'week_sections', 'quarter_sections']
bikeshare.drop(cols_to_drop, axis=1, inplace=True)
for i, col in enumerate(bikeshare.columns):
print('{}'.format(i).ljust(2, " ") + ':' + '{}'.format(col))
Reorder columns in the dataset:
reorder columns as relevant/numerical data to the left most for visual analysis
reordered_columns = ['trip_id', 'bike_id', 'distance_miles', 'duration_min', 'fare',
'trip_type', 'bike_type', 'pass_type', 'fare_type', 'start_time',
'year', 'quarter', 'month', 'week', 'weekday', 'day', 'daytime','hour',
'end_time', 'start_station_id', 'start_lat', 'start_lon',
'end_station_id', 'end_lat', 'end_lon']
bikeshare = bikeshare.reindex(columns=reordered_columns)
for i, col in enumerate(bikeshare.columns):
print('{}'.format(i).ljust(2, " ") + ':' + ' {}'.format(col))
# display current palette
current_palette = sb.color_palette()
sb.palplot(current_palette)
plt.show()
# set the palette to support 'colorblind'
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
current_palette = sb.color_palette()
sb.palplot(current_palette)
plt.show()
# visually confirm the palette change
current_palette = sb.color_palette()
sb.palplot(current_palette)
plt.show()
Explanatory Data Analysis¶=========================================
Column: trip_typeData type: categorical data, nominalPlot : Bar chart, Point plot, Facet gridAggregated distribution of bike rentals based on trip type:¶# Assign color palette as per requirement
sb.set_style("white")
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
current_palette = sb.color_palette()
base_color = sb.color_palette()[0]
# prepare data for the plot
trip_type_order = bikeshare.trip_type.value_counts().index
max_count = bikeshare['trip_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]
# Seaborn's countplot
sb.countplot(data = bikeshare, x = 'trip_type', color = base_color, alpha= 0.5, order = trip_type_order)
# improve plot aesthetics
plt.title('Aggregated distribution of bike rentals based on trip type\n', fontsize = 16, weight='bold')
plt.xlabel('\nTrip type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
trip_type_counts = bikeshare['trip_type'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
try:
# get the text property for the label to get the correct count
count = trip_type_counts[label.get_text()]
except KeyError:
count = 0
pct_string = '{:0.0f}%'.format(100*count/n_points)
# print the annotation depending on the bar length
if count < (n_points/10):
plt.text(loc, count + (n_points/20), pct_string, ha = 'center', color = 'black', fontsize = 14)
else:
plt.text(loc, count - (n_points/10), pct_string, ha = 'center', color = 'black', fontsize = 14);
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.1 Aggregated distribution of bike rentals based on trip type.png', dpi=300, bbox_inches='tight')
The above plot depicts that the customers prefer One Way trips compared to Round Trip's for bike rental. However the above plots is graphed based on the overall summation of bike rentals and does not portray any trends/petterns that influence the trip type of the bike rentals over timeline. Hence, let us calculate the average bike rentals distributed over the hour of the day cateforized by trip type.
Average rentals based on the hour of the day over trip type:¶# create a dataset for bike rentals over the hour of the day
hours_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"],
bikeshare["hour"],
bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Point plot:
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ["#fff480"]
sb.set_palette(flatui, n_colors=1, desat=0.8)
base_color = sb.color_palette()[0]
# Seaborn's pointplot
ax = sb.pointplot(data = hours_df, x = "hour", y = "rentals", linestyles = "--", hue = 'trip_type')
# improve plot aesthetics
# -------------------------------------------------------
plt.title('Hourly average bike rentals categorized by trip type\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nHour of the day', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
hours_rental_avg_max = 70
y_tick_values = np.arange(0, hours_rental_avg_max+10, 10)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1,
framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# plot custom axial grid lines
for loc in y_tick_values:
plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.2)
# add ellipse
# -------------------------------------------------------
ax.add_patch(
patches.Ellipse(
(2.5, 3), # (x,y)
6, # width
12, # height
5, # radius
alpha=0.2, facecolor="grey", edgecolor="lightgrey", linewidth=1, linestyle='solid'
)
)
# -------------------------------------------------------
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.2 Hourly average bike rentals based on the trip type.png', dpi=300, bbox_inches='tight')
The above plot depicts that the average number of One Way trips are higher than the Round trips given any hour between 6:00 AM and 12:00AM in a day. While there exists a grey area where the average number of bike rentals are very low and statistically not significant for comparision.
The above plot is calculated over 3 years summation of the data. Let us look at the individual years to check whether the same trend follow over different years or not?
Average rentals based on the hour of the day by trip type over years:¶# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and bike type',
fontsize = 16, weight = 'bold')
g.set_titles('Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]
plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.1, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.3 Distribution of bike rentals based on trip type.png', dpi=300, bbox_inches='tight')
Observation:
It appears the domination of
One Waytrips overRound Tripsare continued in individual years over the hour of the day.
Let us take a look at the other factors that influence the bike trips over time:
Average bike rentals based on day of the week over years by trip type:¶# create a dataset for bike rentals over each day in a week
weekday_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["week"],
bikeshare["weekday"],
bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
weekday_df.head(10)
Point plot:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# Facet grid with point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = weekday_df, col = 'trip_type', height = 4, aspect = 1, hue = 'year')
g.map(sb.pointplot, "weekday", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on day of the week over years by trip type',
fontsize = 16, weight = 'bold')
g.set_titles('Trip = {col_name}', weight = 'bold', size = 14, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
ax.set_xticklabels(labels, rotation = 30, size = 12)
if i == 1:
# Change the transparency of the lines in the second plot
ax.lines[0].set_alpha(0.6)
ax.lines[1].set_alpha(0.6)
ax.lines[2].set_alpha(0.6)
ax.lines[0].set_markerfacecolor('#9dc0d1')
ax.lines[0].set_markerfacecolor('#60859e')
ax.lines[1].set_markerfacecolor('#93d1fa')
ax.set_facecolor('0.97')
elif i == 0:
# Change the transparency of the lines in the first plot
ax.lines[0].set_alpha(0.8)
ax.lines[1].set_alpha(0.8)
ax.lines[0].set_markerfacecolor('#9dc0d1')
ax.lines[1].set_markerfacecolor('#60859e')
# Add a inverted triangle marker at desired data point
ax.lines[2].set_markevery(every=[5,6])
ax.lines[2].set_marker('v')
ax.lines[2].set_markersize(10)
ax.lines[2].set_markeredgewidth(2)
ax.lines[2].set_markerfacecolor('orange')
ax.lines[2].set_markeredgecolor('black')
# sort the y_tick_names and assign them as new yticks
g.set_yticklabels(size = 12)
g.set_xlabels('\nDay of the week', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# -------------------------------------------------------
## add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.4, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.4 Average bike rentals based on day of the week over years by trip type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that the average number of bike rentals over the day of the week, subjected to
One Waytrips experiences a sudden drop during weekends saySaturdayandSunday. This drop is especially huge in 2019 compared to other years.- While
Round Tripsexperinces a slight increase in average bike rentals over the weekends.
Reforms:
- Care should be taken to increase the number of bike rentals during the end of the week. Organizing recreational events like
Bike rally's will significantly increases the bike rentals during the holidays/weekends.- Announing discounts on
One Waytrips from stations withhigh bike countto stations withLow bike countduring the weekdays will normalize the distribution of bike over all stations.
Average bike rentals based on quarter of the year by trip type:¶# create a dataset for bike rentals over each quarter in a year
quarter_df = bikeshare.groupby([bikeshare["year"],
bikeshare["quarter"],
bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
quarter_df.head(10)
Point plot:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# Seaborn's point plot
plot_order = quarter_df.quarter.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = quarter_df, col = 'trip_type', col_wrap = 2, height = 4, aspect = 1, hue = 'year')
g.map(sb.pointplot, "quarter", "rentals", order= plot_order, linestyles = "-", ci = None);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on quarter over the years by trip type', fontsize = 16, weight = 'bold')
g.set_titles('{col_name}', weight = 'bold', size = 14, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
y_tick_names = []
for i, ax in enumerate(g.axes.flat):
# get y labels
for y_label in ax.get_yticklabels():
y_label_value = int(y_label.get_text().replace('−','-'))
y_label_new = '{:0.0f} K'.format(y_label_value/1000)
if y_label_new not in y_tick_names:
y_tick_names.append(y_label_new)
if i == 1:
# Change the transparency of the lines in the second plot
ax.lines[0].set_alpha(0.6)
ax.lines[1].set_alpha(0.6)
ax.lines[2].set_alpha(0.6)
ax.lines[0].set_markerfacecolor('#9dc0d1')
ax.lines[0].set_markerfacecolor('#60859e')
ax.lines[1].set_markerfacecolor('#93d1fa')
ax.set_facecolor('0.97')
elif i == 0:
# Change the transparency of the lines in the first plot
ax.lines[0].set_alpha(0.8)
ax.lines[1].set_alpha(0.8)
ax.lines[0].set_markerfacecolor('#9dc0d1')
ax.lines[1].set_markerfacecolor('#60859e')
# Add a inverted triangle marker at desired data point
ax.lines[2].set_markevery(every=[1])
ax.lines[2].set_marker('v')
ax.lines[2].set_markersize(12)
ax.lines[2].set_markeredgewidth(2)
ax.lines[2].set_markerfacecolor('orange')
ax.lines[2].set_markeredgecolor('black')
# sort the y_tick_names and assign them as new yticks
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
g.set_xticklabels(plot_order, size = 12)
g.set_xlabels('\nQuarter of the year', size = 14)
g.set_ylabels('Avg. Bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.4, 1));
# -------------------------------------------------------
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.5 Average bike rentals based on quarter over the years by trip type.png', dpi=300, bbox_inches='tight')
The above plot depicts that the year 2019 experiences a relatively low number of average bike rentals subjected to One Way trips, in the second quarter of the year. Let us take a deeper look at this insight.
Average bike rentals based on month of the year by trip type:¶# create a dataset for bike rentals over each hour in a day
month_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["trip_type"]]).count()['trip_id'].reset_index(name='rentals')
month_df.head(10)
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
# Facet grid with point plot
plot_order = month_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = month_df, col = 'year', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average bike rentals based on the month over years by trip type',
fontsize = 16, weight = 'bold')
g.set_titles('Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nMonth of the year', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
y_tick_names = []
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get y labels
for y_label in ax.get_yticklabels():
y_label_value = int(y_label.get_text().replace('−','-'))
y_label_new = '{:0.0f} K'.format(y_label_value/1000)
if y_label_new not in y_tick_names:
y_tick_names.append(y_label_new)
if i != 2:
# Change the transparency of first line
ax.lines[0].set_alpha(0.2)
ax.lines[0].set_markerfacecolor('#ff9ec6')
ax.set_facecolor('0.97')
else:
# Add a inverted triangle marker at desired data point
ax.lines[0].set_markevery(every=slice(0,5,1))
ax.lines[0].set_marker('v')
ax.lines[0].set_markersize(8)
ax.lines[0].set_markeredgewidth(1)
ax.lines[0].set_markerfacecolor('black')
ax.lines[0].set_markeredgecolor('orange')
if i != 0:
ax.axvline(0, ls='--', color='grey', linewidth=1, alpha=0.5)
sb.despine(left = True, ax=ax)
# Change the transparency of second line
ax.lines[1].set_alpha(0.2)
ax.lines[1].set_markerfacecolor('#97f7e9')
# set xlabels fontsize
labels = ax.get_xticklabels()
ax.set_xticklabels(labels, size = 12)
# sort the y_tick_names
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]
plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.1, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.0, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.6 Average bike rentals based on the month over the years by trip type.png', dpi=300, bbox_inches='tight')
Observation:
- It appears that the first half of the year 2019 experiences a relatively low number of bike rentals subjected to
One Waytrips, compared to later half of the year. This trend is not limited to year 2019 but consistent over the other years.
Reform:
- Promotions/discounts should be offered on
One Waytrips over the first half of the year to encourage the customers to take more number ofOne Way trips.
Average bike rentals over each hour in a day by trip type and bike type:¶# create a dataset for bike rentals over each hour in a day by trip type and bike type
hours_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"],
bikeshare["hour"],
bikeshare["trip_type"],
bikeshare["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'bike_type', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and bike type',
fontsize = 16, weight = 'bold')
g.set_titles('Bike = {row_name} | Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
for loc,label in enumerate(labels):
# skip labels
if not (loc%5 == 0): labels[loc] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
if i%3 != 0:
sb.despine(left = True, ax=ax)
ax.axvline(0, ls='--', color='grey', linewidth=1, alpha=0.5);
if i in [0, 1, 2, 3, 6, 9, 10, 11]:
# Change the transparency of first line
ax.lines[0].set_alpha(0.4)
ax.lines[0].set_markerfacecolor('#ff9ec6')
ax.set_facecolor('0.97')
if i in [4, 5, 7, 8]:
ax.lines[0].set_marker('o')
ax.lines[0].set_markersize(4)
ax.lines[0].set_markerfacecolor('#e36297')
# Change the transparency of second line
ax.lines[1].set_alpha(0.4)
ax.lines[1].set_markerfacecolor('#97f7e9')
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]
plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 3, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0, 5.5));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.7 Average bike rentals based on hour of the day over years by trip type and bike type.png', dpi=300, bbox_inches='tight')
The above plot depicts that the Standard bike rentals subjected to One Way trips has reduced significantly during 2019 compared to previous year 2018. At the same there is a noticable increase in Electric bike rentals subjected to One Way trips. This depicts that the customers preffered Electric bikes over Standard bikes for One Way trips in the year 2019. Let us take a close look at the reason begind this trend.
Average bike rentals of standard and electric bikes over 2018 and 2019 by trip type:¶# create a dataset for bike rentals of standard and electric bikes over 2018 and 2019 by trip type
hours_df = hours_df.query(' (year == 2018 or year == 2019) and (bike_type == "Standard" or bike_type == "Electric")').copy()
level_order = ['Standard', 'Electric']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
hours_df['bike_type'] = hours_df['bike_type'].astype(ordered_cat)
hours_df
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
# Facet grid with point plot
plot_order = hours_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'bike_type', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals of standard and electric bikes over 2018 and 2019 by trip type',
fontsize = 16, weight = 'bold')
g.set_titles('Bike = {row_name} | Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# set xlabels fontsize
labels = ax.get_xticklabels()
ax.set_xticklabels(labels, size = 12)
if i%2 != 0:
sb.despine(left=True, ax = ax)
ax.axvline(0, ls='--', color='grey', linewidth=1, alpha=0.5);
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]
plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.6, 2.3));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.8 Average bike rentals of standard and electric bikes over 2018 and 2019 by trip type.png', dpi=300, bbox_inches='tight')
As the Electric bike was introduced during the end of the year 2018, the customers that used Standard bikes suddenly shifted towrds Electric bikes after the second quarter of the year 2019. This states that the customer's preference over the bike has changed but the total number of bike rentals subjected to One Way trips has not decreased from the plot 3.1.3.
Distribution of bike rentals over trip type by the fare type:¶# create a dataset for bike rentals over each hour in a day by trip type and fare type
hours_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"],
bikeshare["hour"],
bikeshare["trip_type"],
bikeshare["fare_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Point plot:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'fare_type', height = 3, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and fare type',
fontsize = 15, weight = 'bold')
g.set_titles('Fare = {row_name} | Year = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
if i in [0, 1, 2]:
# Change the transparency of first line
ax.lines[0].set_alpha(0.4)
ax.lines[1].set_alpha(0.4)
ax.set_facecolor('0.97')
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]
plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.1.9 Average bike rentals based on hour of the day over years by trip type and fare type.png', dpi=300, bbox_inches='tight')
Observation:
- The above plot depicts that the customers that pay
Extendedfares takes almost same number ofRound Tripsas ofOne Waytrips.- While customers that pay
Basefare, preferOne Waytrips.
Reform:
- The above behaviour has a positive end result and does not require any intervene. Because if more number of customers take longer rides (as fares and bike durations are correlated) subjected to
One Waytrips, then the bikes will end up at a farther location from their home station and might create a gap between the supply and demand of the bikes at these stations. However, as more number of customers preferOne Waytrips for less duration rides, the bikes will end up in the same geographical cluster and customers can be easily redirected to the nearby availble stations in case of bike deficiency.
One Way trips compared to Round Trip's for bike rental with a grey area in the Early hours of the day, where the average number of bike rentals are very low and statistically not significant for comparision.One Way trips decreases during Saturaday's and Sunday's, while Round Trips experiece a slight increase.One Way trips, compared to later half of the year. This trend is not limited to year 2019 but consistent over the other years.One Way trips has reduced significantly during 2019 compared to previous year 2018. At the same there is a noticable increase in Electric bike rentals subjected to One Way trips. This depicts that the customers preffered Electric bikes over Standard bikes for One Way trips in the year 2019. However, as the Electric bike was introduced during the end of the year 2018, the customers that used Standard bikes suddenly shifted towrds Electric bikes after the second quarter of the year 2019. This states that the customer's preference over the bike has changed but the total number of bike rentals subjected to One Way trips has not decreased.Base fare prefer One Way trips, while the customers that pay Extended fares takes almost same number of Round Trips as of One Way trips and does not exhibit any preference over trip types.Bike rally's will significantly increases the bike rentals during the holidays/weekends.One Way trips from stations with high bike count to stations with Low bike count during the weekdays will normalize the distribution of bike over all stations.One Way trips over the first half of the year to encourage the customers to take more number of One Way trips.Extended fares subjected to One Way trips has a positive end result and does not require any intervene. Because if more number of customers take longer rides (as fares and bike durations are correlated) subjected to One Way trips, then the bikes will end up at a farther location from their home station and might create a gap between the supply and demand of the bikes at these stations. However, as more number of customers prefer One Way trips for less duration rides (base fare), the bikes will end up in the same geographical cluster which eases the redirection of customers to the nearby available stations in case of bike deficiency.Column: bike_typeData type: categorical data, nominalPlot : Bar chart, Point plot, Facet gridAggregated distribution of bike rentals based on bike type:¶# Assign color palette as per requirement
sb.set_style('white')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[2]
# prepare data for the plot
bike_type_order = bikeshare.bike_type.value_counts().index
max_count = bikeshare['bike_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]
# Seaborn's countplot
sb.countplot(data = bikeshare, x = 'bike_type', color = base_color, alpha= 0.5, order = bike_type_order)
# improve plot aesthetics
plt.title('Aggregrated rentals based on bike type\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
bike_type_counts = bikeshare['bike_type'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
try:
# get the text property for the label to get the correct count
count = bike_type_counts[label.get_text()]
except KeyError:
count = 0
pct_string = '{:0.0f}%'.format(100*count/n_points)
# print the annotation depending on the bar length
if count < (n_points/20):
plt.text(loc, count + (n_points/40), pct_string, ha = 'center', color = 'black', fontsize = 13)
else:
plt.text(loc, count - (n_points/20), pct_string, ha = 'center', color = 'black', fontsize = 13);
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.1 Aggregated distribution of bike rentals based on bike type.png', dpi=300, bbox_inches='tight')
The above plot depicts that the standard bikes are in more demand compared to electric and smart bikes. However more than 50% of the bike rentals does not have the bike type labels which makes this data unreliable. Also, the calculation is performed based on the aggregated data of the bike rentals over 3 years and require deeper analysis segmented over each year for any hidden insights.
Aggregated bike rentals based on bike type over years:¶# Assign color palette as per requirement
sb.set_style('white')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[2]
# prepare data for the plot
bike_type_order = bikeshare.bike_type.value_counts().index
max_count = bikeshare['bike_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]
# Seaborn's countplot
g = sb.FacetGrid(data = bikeshare, col = 'year', height = 4.5, aspect = 0.8)
g.map(sb.countplot, 'bike_type', color = base_color, alpha= 0.5, order = bike_type_order);
# improve plot aesthetics
# -------------------------------------------------------
y_tick_names = []
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get y labels
for y_label in ax.get_yticklabels():
y_label_value = int(y_label.get_text().replace('−','-'))
y_label_new = '{:0.0f} K'.format(y_label_value/1000)
if y_label_new not in y_tick_names:
y_tick_names.append(y_label_new)
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Aggregated bike rentals based on bike type over the years', fontsize = 16, weight = 'bold')
g.set_titles('Year = {col_name}', weight = 'bold', size = 14, color = 'dimgrey')
g.set_xlabels('\nBike Type', size = 12)
g.set_ylabels('Bike rentals\n', size = 12)
g.set_xticklabels(size = 12)
# sort the y_tick_names and assign them as new yticks
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
# -------------------------------------------------------
# add annotations
# -------------------------------------------------------
total_count = bikeshare.shape[0]
for ax in g.axes.ravel(): # loops over the different figures in the grid
for i, p in enumerate(ax.patches): # loops over the different bars in each figure
ax.annotate('{:0.1f}%'.format(100*p.get_height()/total_count), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# fade out the bars related to unknown bike type
if i == 0:
p.set_alpha(0.2);
# ------------------------------------------------------
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.2 Aggregated bike rentals based on bike type over years.png', dpi=300, bbox_inches='tight')
The above plot depicts that the classification of bikes was introduced somewhere in the year 2018, which is the reason for the existance of the bikes with the unknown label in the plots subjected to years 2017 and 2018. But the timeline at which the classification of bike rentals was introduced in the year 2018, is not clear and requires further analysis, as whether to include or exclude the rentals subjected to year 2018 in the further analysis.
Average bike rentals based on month over years by trip type and bike type:¶# create a dataset for bike rentals over each month for all years
months_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["trip_type"],
bikeshare["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
months_df.head(10)
Facet grid:
def plot_rectangle(ax, x, y, width, height):
'''plots rectangular patch in the specified axis'''
ax.add_patch(
patches.Rectangle(
(x,y),
width,
height,
# You can add rotation with 'angle'
alpha=0.25, facecolor="gold", edgecolor="gold", linewidth=1, linestyle='solid'
)
)
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#ff7ddd', '#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=4, desat=0.6)
# Facet grid with point plot
plot_order = months_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = months_df, col = 'year', row = 'trip_type', height = 4, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on month over years by trip type and bike type',
fontsize = 16, weight = 'bold')
g.set_titles('Trip Type = {row_name} | Year = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
y_tick_names = []
for i, ax in enumerate(g.axes.flat):
# get y labels
for y_label in ax.get_yticklabels():
y_label_value = int(y_label.get_text().replace('−','-'))
y_label_new = '{:0.0f} K'.format(y_label_value/1000)
if y_label_new not in y_tick_names:
y_tick_names.append(y_label_new)
# get x labels
xlabels = ax.get_xticklabels()
ax.set_xticklabels(xlabels, size = 12)
if i not in [0, 3]:
sb.despine(left=True, ax = ax)
# sort the y_tick_names and assign them as new yticks
y_tick_names.sort()
g.set_yticklabels(y_tick_names, size = 12)
g.set_xlabels('\nMonth of the year', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[3], linestyle='-', linewidth = 2)]
labels = months_df.bike_type.sort_values(ascending=True).unique()
plt.legend(custom, labels, scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 0.9));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# Add rectangles to highlight the area of interest
# -------------------------------------------------------
ax1 = g.facet_axis(0,1)
ax2 = g.facet_axis(1,1)
plot_rectangle(ax = ax1, x=8, y=0, width = 3, height = 30000)
plot_rectangle(ax = ax2, x=8, y=0, width = 3, height = 10000);
# -------------------------------------------------------
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.3 Average bike rentals based on month over years by trip type and bike type.png', dpi=300, bbox_inches='tight')
The yellow spots in the above plots depicts that the classification of bikes was introduced at the end of the year 2018. Hence the rentals related to unknown bike category subjected to the year 2018 can be ignored and limit the analysis mostly to the year 2019 in the further plots for clear insights.
Average rentals based on the daytime:¶month_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
# month_df['rentals'] = month_df['rentals'].fillna(0).astype(int)
month_df = month_df.query(' year == 2019 ')
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
month_df['bike_type'] = month_df['bike_type'].astype(ordered_cat)
month_df.head(10)
Point plot:
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
base_color = sb.color_palette()[0]
# Seaborn's pointplot
ax = sb.pointplot(data = month_df, x = "month", y = "rentals", linestyles = "-", hue = 'bike_type',
scale = 1, ci = None)
# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average monthly bike rentals categorized by bike type in 2019\n\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals (thousands)\n', fontsize = 14)
plt.xlabel('\nMonth of the year', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
monthly_rental_avg_max = locs.max()
y_tick_values = np.arange(0, monthly_rental_avg_max+5000, 5000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1,
framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1))
# plot custom axial grid lines
for loc in y_tick_values:
plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.5)
# add ellipse
# -------------------------------------------------------
ax.add_patch(
patches.Ellipse(
(8, 12000), # (x,y)
7, # width
12000, # height
0, # radius
alpha=0.2, facecolor="gold", edgecolor="gold", linewidth=1, linestyle='solid'
)
);
# -------------------------------------------------------
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.4 Average monthly bike rentals categorized by bike type in 2019.png', dpi=300, bbox_inches='tight')
Observation:
The above plot depicts that the bike rentals for the
Standardbike type decreases over the year 2019, while the rentals for the bike typeSmartandElectricincreases with in the timeline of the year. Hence, even thoughStandardbikes are popular during the start of the year 2019, customers preferredSmartandElectricbikes towards the end of the year 2019. Hence it can be concluded that the lauch ofElectricandSmartbikes are a success.
Let us take a look at the other factors that influence the bike type over time:
Average hourly bike rentals categorized by bike type in 2019:¶# create a dataset for bike rentals over each hour in a day in the year 2019
temp_df = bikeshare.query(' year == 2019 ').copy()
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
hours_df = temp_df.groupby([temp_df["month"],
temp_df["day"],
temp_df["hour"],
temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Facet grid:
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
base_color = sb.color_palette()[0]
# Seaborn's pointplot
ax = sb.pointplot(data = hours_df, x = "hour", y = "rentals", linestyles = "-", hue = 'bike_type',
scale = 1, ci = None)
# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average hourly bike rentals categorized by bike type in 2019\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals (thousands)\n', fontsize = 14)
plt.xlabel('\nHour of the day (year 2019)', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1,
framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,
title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.25, 1));
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.5 Average hourly bike rentals categorized by bike type in 2019.png', dpi=300, bbox_inches='tight')
Standard bike between 7:00 AM to 9:00 AM, and 3:00 PM to 5:00 PM, which are office reporting times and relieving times. Which conveys that the working individuals preferred Standard bikes to ride to their work locations and getting back home after work.Electric bikes are as popular as Standard bikes over time, they are particularly preferred between 7:00 PM to 12:00 AM.Is the trend influenced by other variables? Let us take a deeper analysis for any hidden trends.
Average hourly bike rentals categorized by bike type and trip type in 2019:¶# create a dataset for bike rentals over each hour in the day by trip type and bike type in the year 2019
temp_df = bikeshare.query(' year == 2019 ').copy()
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
hours_df = temp_df.groupby([temp_df["month"],
temp_df["day"],
temp_df["hour"],
temp_df["trip_type"],
temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Facet grid:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'trip_type', col_wrap = 3, height = 3.5, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average hourly bike rentals by bike type and trip type in 2019', fontsize = 15, weight = 'bold')
g.set_titles('Pass = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------
# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
# ax1.lines[1].set_alpha(0.4)
# ax1.lines[2].set_alpha(0.4)
ax2 = g.facet_axis(0,1)
# ax2.lines[0].set_alpha(0.4)
# ax2.lines[2].set_alpha(0.4)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.5, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.6 Average hourly bike rentals by bike type and trip type in 2019.png', dpi=300, bbox_inches='tight')
It appears that the trend is consistent in One Way trips while the customers that take Round Trips does not have any preference in bikes. However the plot is calculated based on the summation of all rentals over the year 2019. Is the same trend consistent over the year 2019? Let us calculate the average bike rentals over the month of the year 2019 for deeper insights.
Average hourly bike rentals categorized by bike type and trip type over the quarters of 2019:¶# create a dataset for bike rentals over each hour in the day by trip type and bike type over quarters in 2019
temp_df = bikeshare.query(' year == 2019 ').copy()
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
hours_df = temp_df.groupby([temp_df["quarter"],
temp_df["month"],
temp_df["day"],
temp_df["hour"],
temp_df["trip_type"],
temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Facet Grid:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'quarter', row = 'trip_type', height = 3, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day by bike type and trip type over quarters of 2019',
fontsize = 15, weight = 'bold')
g.set_titles('Trip = {row_name} | Quarter = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------
# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
ax1.lines[1].set_alpha(0.4)
ax1.lines[2].set_alpha(0.4)
ax2 = g.facet_axis(0,1)
ax2.lines[2].set_alpha(0.4)
ax3 = g.facet_axis(0,2)
ax3.lines[0].set_alpha(0.4)
ax3.lines[2].set_alpha(0.4)
ax4 = g.facet_axis(0,3)
ax4.lines[0].set_alpha(0.4)
ax4.lines[2].set_alpha(0.4)
ax5 = g.facet_axis(1,0)
ax6 = g.facet_axis(1,1)
ax7 = g.facet_axis(1,2)
ax8 = g.facet_axis(1,3)
for ax in [ax5, ax6, ax7, ax8]:
ax.lines[0].set_alpha(0.4)
ax.lines[1].set_alpha(0.4)
ax.lines[2].set_alpha(0.4)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.7 Average hourly bike rentals by bike type and trip type in quarter of 2019.png', dpi=300, bbox_inches='tight')
Observations:
- Even though
Standardbikes are the most popular choice during the first quarter of the year, theElectricbikes gradually gained popularity amongOne Waytrips over the rest of the year.- While customers that take
Round Tripsdoes not have any preference over bike types.
Average hourly bike rentals categorized by bike type and pass type in 2019:¶# create a dataset for bike rentals over each hour in the day by pass type and bike type in the year 2019
temp_df = bikeshare.query(' year == 2019 ').copy()
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
level_order = ['One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['pass_type'] = temp_df['pass_type'].astype(ordered_cat)
hours_df = temp_df.groupby([temp_df["month"],
temp_df["day"],
temp_df["hour"],
temp_df["pass_type"],
temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Facet Grid:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'pass_type', col_wrap = 3, height = 3.5, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.75)
g.fig.suptitle('Average hourly bike rentals by bike type and pass type in 2019', fontsize = 15, weight = 'bold')
g.set_titles('Pass = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
g.set_yticklabels(size = 12)
g.set_xlabels('\nHour of the day', size = 13)
g.set_ylabels('Avg. bike rentals\n', size = 13)
# -------------------------------------------------------
# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
ax1.lines[1].set_alpha(0.4)
ax1.lines[2].set_alpha(0.4)
ax2 = g.facet_axis(0,1)
ax2.lines[0].set_alpha(0.4)
ax2.lines[2].set_alpha(0.4)
ax3 = g.facet_axis(0,2)
ax3.lines[0].set_alpha(0.4)
ax3.lines[1].set_alpha(0.4)
ax3.lines[2].set_alpha(0.4)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.5, 1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.8 Average hourly bike rentals by bike type and pass type in 2019.png', dpi=300, bbox_inches='tight')
The above plot depicts that the customers with One Day pass prefer Standard bikes while customers with Monthly pass prefer Electric bikes. As the number of bike rentals subjectd to Annual pass are very low, the bike preference of customers with Annual pass is not evaluated.
However the plot is calculated based on the summation of all rentals over the year 2019. Is the same trend consistent over the year 2019? Let us calculate the average bike rentals over the month of the year 2019 for deeper insights.
Average hourly bike rentals categorized by bike type and pass type over the quarters of 2019:¶# create a dataset for bike rentals over each hour in the day by pass type and bike type over quarters in 2019
temp_df = bikeshare.query(' year == 2019 ').copy()
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
level_order = ['One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['pass_type'] = temp_df['pass_type'].astype(ordered_cat)
hours_df = temp_df.groupby([temp_df["quarter"],
temp_df["month"],
temp_df["day"],
temp_df["hour"],
temp_df["pass_type"],
temp_df["bike_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Facet grid:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#4b99eb', '#77f7cc', '#aa75fa']
sb.set_palette(flatui, n_colors=3, desat=0.6)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'quarter', row = 'pass_type', height = 3, aspect = 1, hue = 'bike_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average bike rentals based on hour of the day by bike type and pass type over quarters of 2019',
fontsize = 15, weight = 'bold')
g.set_titles('Trip = {row_name} | Quarter = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day (year 2019)', size = 13)
g.set_ylabels('Avg. bike rentals\n', size = 13)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for i, ax in enumerate(g.axes.flat):
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------
# set transperancy of the redundant lines in each axis
# -------------------------------------------------------
ax1 = g.facet_axis(0,0)
ax2 = g.facet_axis(0,1)
ax3 = g.facet_axis(0,2)
ax4 = g.facet_axis(0,3)
for ax in [ax1, ax2, ax3, ax4]:
ax.lines[1].set_alpha(0.4)
ax.lines[2].set_alpha(0.4)
ax5 = g.facet_axis(1,0)
ax6 = g.facet_axis(1,1)
ax7 = g.facet_axis(1,2)
ax8 = g.facet_axis(1,3)
for ax in [ax5, ax6, ax7, ax8]:
ax.lines[0].set_alpha(0.4)
ax.lines[2].set_alpha(0.4)
ax9 = g.facet_axis(2,0)
ax10 = g.facet_axis(2,1)
ax11 = g.facet_axis(2,2)
ax12 = g.facet_axis(2,3)
for ax in [ax9, ax10, ax11, ax12]:
ax.lines[0].set_alpha(0.4)
ax.lines[1].set_alpha(0.4)
ax.lines[2].set_alpha(0.4)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['Standard', 'Electric', 'Smart'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 3, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(-0.4, 4.3));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.2.9 Average hourly bike rentals by bike type and pass type in quarter of 2019.png', dpi=300, bbox_inches='tight')
Observations:
- Even though
Standardbikes are the most popular choice for the customers withOne Daypass during the first quarter of the year, the number of bike rentals subjected toOne Daypass decreased to a point that there is no significant difference in bike pereference betweenStandardbikes andSmartbikes towards the end of the year 2019.- The customers that has
Monthlypass preferredStandardbikes during the first quarter of the year, however theElectricbikes gained more popularity over the rest of the year 2019.- As the number of bike rentals subjectd to
Annual passare very low, the bike preference of customers withAnnual passis not evaluated.
unknown bike category are ignored and the analysis limited to the year 2019.Standard bike type decreases over the year 2019, while the rentals for the bike type Smart and Electric increases with in the timeline of the year. Hence, even though Standard bikes are popular during the start of the year 2019, customers preferred Smart and Electric bikes towards the end of the year 2019. Hence it can be concluded that the lauch of Electric and Smart bikes are a success.Standard bikes are the most popular choice during the first quarter of the year, the Electric bikes gradually gained popularity among One Way trips over the rest of the year.Round Trips does not have any preference over bike types.Standard bikes are the most popular choice for the customers with One Day pass during the first quarter of the year, the number of bike rentals subjected to One Day pass decreased to a point that there is no significant difference in bike pereference between Standard bikes and Smart bikes towards the end of the year 2019.Monthly pass preferred Standard bikes during the first quarter of the year, however the Electric bikes gained more popularity over the rest of the year 2019.Smart bikes were introduced along with the Electric bikes, they failed to gain as much popularity as of Electric bikes. Hence dicounts should be announced to increase the rental activity of Smart bikes during the peak hours, which inturn helps the stations to maintain the availabilty of other bikes types.Smart bikes in promotional events like Bike rallies to familiarize customers with its features and encourage the customers to prefer Smart bikes in the future.Column: pass_typeData type: categorical data, nominalPlot : Bar chart, Point plot , Facet gridAggregated distribution of bike rentals based over pass type:¶# Assign color palette as per requirement
sb.set_style("white")
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[6]
# prepare data for the plot
pass_type_order = bikeshare.pass_type.value_counts().index
max_count = bikeshare['pass_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]
# Seaborn's countplot
ax = sb.countplot(data = bikeshare, x = 'pass_type', color = base_color, alpha= 1,
order = pass_type_order, saturation = 0.5)
# improve plot aesthetics
plt.title('Aggregated bike rentals based on customer pass\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
pass_type_counts = bikeshare['pass_type'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
try:
# get the text property for the label to get the correct count
count = pass_type_counts[label.get_text()]
except KeyError:
count = 0
count_percent = 100*count/n_points
if count_percent < 0.1:
pct_string = '< 1%'
else:
pct_string = '{:0.0f}%'.format(count_percent)
# print the annotation depending on the bar length
if count < (n_points/20):
plt.text(loc, count + (n_points/30), pct_string, ha = 'center', color = 'black', fontsize = 12)
else:
plt.text(loc, count - (n_points/25), pct_string, ha = 'center', color = 'black', fontsize = 12);
# -------------------------------------------------------
# loops over the different bars in each figure and
# fade out the bars other than highest rental pass
for i, p in enumerate(ax.patches):
if i != 0:
p.set_alpha(0.6);
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.3.1 Aggregated distribution of bike rentals based over pass type.png', dpi=300, bbox_inches='tight')
The above plot depicts that the Monthly Pass is the most popular subscription among the customers. However, the calculation is performed based on the aggregated data of the bike rentals over 3 years and require deeper analysis segmented over each year for any hidden insights.
Aggregated yearly rentals based on pass type:¶# create a dataset for bike rentals over the years by pass type
categorical_counts = bikeshare.groupby([bikeshare['pass_type'],
bikeshare['year']]).size().reset_index(name='rentals')
categorical_counts.head(10)
Line plot:
# set the palette as per requirement
sb.set_style('whitegrid')
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui, n_colors=5, desat=0.6)
plt.figure(figsize = [6, 4])
max_count = categorical_counts.rentals.max()
y_tick_values = np.arange(0, max_count + 50000, 50000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
# plot line plot
ax = sb.lineplot(data=categorical_counts, x = "year", y = "rentals", hue="pass_type", linewidth=3, alpha = 0.8,
style="pass_type", err_style="bars", markers = ['o', 'o', 'o', 'o', 'o'], markersize=10)
ax.lines[0].set_linestyle("-")
ax.lines[1].set_linestyle("-")
ax.lines[2].set_linestyle("-")
ax.lines[3].set_linestyle("-")
ax.lines[4].set_linestyle("-")
plt.title('Aggregated yearly rentals based on pass type\n', weight = 'bold', fontsize = 16)
plt.yticks(y_tick_values, y_tick_names, fontsize=12)
plt.xlabel('\nYear', fontsize=14)
plt.ylabel('Rentals (Thousands)\n', fontsize=14)
# set custom xticks to avoid segmentation of continuous values of year (['2017.25', '2017.50', '2017.75, ....'])
# get alloted xticks, if decimal part equals to 0, set it as xtick value else skip the xtick value
xlabels = ['{:.0f}'.format(x) if divmod(x, 1)[1] == 0 else "" for x in ax.get_xticks()]
ax.set_xticklabels(xlabels)
plt.xticks(fontsize=12)
# customize legend
leg = ax.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,
title='Pass Type', title_fontsize=12, fontsize=10, facecolor='white', markerfirst=True,
handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.35, 1))
leg_lines = leg.get_lines()
leg_lines[1].set_linestyle("-")
leg_lines[2].set_linestyle("-")
leg_lines[3].set_linestyle("-")
leg_lines[4].set_linestyle("-")
leg.texts[0].set_text("");
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.3.2 Aggregated yearly rentals based on pass type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that the
Monthlypass has always been the most popular choice for the customers. And discontinuation ofWalk-uppass in 2019 has even more increased the number of bike rentals subjectd toMonthlypass.Flexpass is an experimental introduction with insignificant number of bike rentals to include in the further analysis.- There is a slight increase in the rentals subjected to
Annualpass in the year 2019.
Let us take a look at the other factors that influence the pass type over time:
Average hourly rentals based on pass type and trip type:¶# create a dataset for bike rentals over each hour in a day by pass type and trip type
hours_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"],
bikeshare["hour"],
bikeshare["trip_type"],
bikeshare["pass_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
Facet grid:
# Assign palette as per requirement
sb.set_style('white')
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui, n_colors=5, desat=None)
# Facet grid with point plot
plot_order = hours_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'year', row = 'trip_type', height = 3, aspect = 1, hue = 'pass_type')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['']);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average hourly rentals based on pass type and trip type', fontsize = 16, weight = 'bold')
g.set_titles('Trip = {row_name} | Year = {col_name}', weight = 'bold', size = 13, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nMonth of the year', size = 13)
g.set_ylabels('Avg. bike rentals\n', size = 13)
g.set_yticklabels(size = 10)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
# get x labels
labels = ax.get_xticklabels()
# for i,l in enumerate(labels):
# skip labels
# if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 10)
ax1 = g.facet_axis(0,1)
ax2 = g.facet_axis(0,2)
ax3 = g.facet_axis(1,1)
ax4 = g.facet_axis(1,2)
for ax in [ax1, ax2, ax3, ax4]:
sb.despine(left = True, ax = ax)
for i, ax in enumerate(g.axes.flat):
if i in [0, 1, 2]:
ax.lines[0].set_alpha(0.3)
ax.lines[1].set_alpha(0.3)
ax.lines[3].set_alpha(0.3)
ax.lines[4].set_alpha(0.3)
if i in [3, 4, 5]:
ax.lines[0].set_alpha(0.3)
ax.lines[2].set_alpha(0.3)
ax.lines[3].set_alpha(0.3)
ax.lines[4].set_alpha(0.3)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[3], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[4], linestyle='-', linewidth = 2)]
labels = hours_df.pass_type.sort_values(ascending=True).unique()
plt.legend(custom, labels, scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Bike Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.6, 2.4));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.3.3 Average hourly rentals based on pass type and trip type.png', dpi=300, bbox_inches='tight')
Observations:
- Majority of the bike rentals subjected to
One Waytrips are taken by the customers withMonthlysubscription.- The rentals taken on
One Daypass experience a steady decrease subjected toOne Waytrips over the years 2018 and 2019, which might be the reason for the increase inmonthlysubscibers for the second half of the year 2019.- Majority of the bike rentals subjected to
Round Trips are taken by the customers withOne Daysubscription.
Monthly pass has always been the most popular choice for the customers. And discontinuation of Walk-up pass in 2019 has even more increased the number of bike rentals subjectd to Monthly pass.Annual pass in the year 2019.One Way trips are taken by the customers with Monthly subscription.One Day subscription experienced a steady decrease subjected to One Way trips over the years 2018 and 2019, which might be the reason for the increase in monthly subscibers for the second half of the year 2019.Round Trips are taken by the customers with One Day subscription.One Day subscription to encourage tourists and non-subscribers to rent a bike.Column: fare_type, fareData type: (categorical data, nominal), (numerical, continuous)Plot : Bar chart, Count plot, Facet grid, Point plotAggregated bike rentals based on fare type:¶# Assign color palette as per requirement
sb.set_style("white")
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[8]
# prepare data for the plot
fare_type_order = bikeshare.fare_type.value_counts().index
max_count = bikeshare['fare_type'].value_counts().max()
tick_values = np.arange(0, max_count + 100000, 100000)
tick_names = ['{:0.1f} M'.format(v/1000000) for v in tick_values]
# Seaborn's countplot
sb.countplot(data = bikeshare, x = 'fare_type', color = base_color, alpha= 0.6,
order = fare_type_order, saturation = 1)
# improve plot aesthetics
plt.title('Aggregated bike rentals based on fare type\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare type', fontsize = 14)
plt.ylabel('Bike rentals (million)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
fare_type_counts = bikeshare['fare_type'].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
try:
# get the text property for the label to get the correct count
count = fare_type_counts[label.get_text()]
except KeyError:
count = 0
pct_string = '{:0.0f}%'.format(100*count/n_points)
# print the annotation depending on the bar length
if count < (n_points/10):
plt.text(loc, count + (n_points/20), pct_string, ha = 'center', color = 'black', fontsize = 14)
else:
plt.text(loc, count - (n_points/10), pct_string, ha = 'center', color = 'black', fontsize = 14);
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.4.1 Aggregated bike rentals based on fare type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that the majority of the customers utilize
Basefare option to reach their destintions.- Decrease in percentage of
Extendedfares will result in decrease in income generation. As the percentage ofExtendedfares are less than20%, some business reforms/promotional programs have to be taken to encourage customers to ride bikes for longer durations.
However not all Base fares are free for the first 30 minutes. Unlike other pass types, the Walk-up pass charge a fare of 1 dollar for Base fare type.
Calculation of Income distribution of trip fares:¶# compute the descriptive statistics of trip fares
bikeshare['fare'].describe()
Breakdown the trip fares into customized sections based on the descriptive statistics of trip fares.
# divide the fare into customized sections
bin = [-1,0,5,10,50,100,600]
#use pd.cut function to attribute the values into its specific bins
fare = pd.cut(bikeshare['fare'],bin)
fare = fare.to_frame()
fare.columns = ['fare_sections']
fare.sample(10)
Count plot:
# Assign palette as per requirement
sb.set_palette('colorblind', n_colors=10, desat = 0.8)
base_color = sb.color_palette()[8]
# Seaborn's count plot
sb.countplot(data = fare, x = 'fare_sections', color = base_color, alpha= 0.8, saturation = 1)
# improve plot aesthetics
# -------------------------------------------------------
plt.title('Income distribution of trip fares\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare (Dollars)', fontsize = 14)
plt.ylabel('Rentals (million)\n', fontsize = 14)
# obtain y_ticks and convert them to a multiple of millions
y_tick_locs = []
locs, labels = plt.yticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
y_tick_locs.append(int(loc))
y_tick_names = ['{:0.1f} M'.format(loc/1000000) for loc in y_tick_locs]
plt.yticks(y_tick_locs, y_tick_names, fontsize=12)
# assigning xticks here will interfere with annotations
# -------------------------------------------------------
# add annotations
# -------------------------------------------------------
n_points = fare.shape[0]
fare_counts = fare.fare_sections.value_counts()
fare_max = fare_counts.max()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
str = (label.get_text()[-4:-1])
num = [int(s) for s in str.split() if s.isdigit()]
if num[0] in fare_counts.index[0]:
count = fare_counts.values[0]
elif num[0] in fare_counts.index[1]:
count = fare_counts.values[1]
elif num[0] in fare_counts.index[2]:
count = fare_counts.values[2]
elif num[0] in fare_counts.index[3]:
count = fare_counts.values[3]
elif num[0] in fare_counts.index[4]:
count = fare_counts.values[4]
else:
count = 0
if (100*count/n_points) < 0.1:
pct_string = '< 0.1%'
else:
pct_string = '{:0.1f}%'.format(100*count/n_points)
# print the annotation depending on the bar length
if count < (fare_max/10):
plt.text(loc, count+(fare_max/25), pct_string, ha = 'center', color = 'black', weight = 'normal', fontsize = 12)
else:
plt.text(loc, count-(fare_max/10), pct_string, ha = 'center', color = 'black', weight = 'normal', fontsize = 12)
# -------------------------------------------------------
# get xticks and change the first categorical expression tto just zero dollars
x_labels_new = ['[0]']
# get the current tick locations and labels
x_locs, x_labels = plt.xticks()
for x_label in x_labels[1:]:
x_labels_new.append(x_label.get_text())
plt.xticks(x_locs, x_labels_new, fontsize=12)
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.4.2 Income distribution of trip fares.png', dpi=300, bbox_inches='tight')
Observations:
- It is evident that the majority of the customers utilize base fare option to reach their destintions.
- Around
25.5%of the bike rentals generate extra income in the form ofExtendedfares, which reflects a healthy business model.
Average monthly rentals over years by fare type:¶# create a dataset for monthly rentals over years by fare type
month_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["fare_type"]]).count()['trip_id'].reset_index(name='rentals')
month_df.head(10)
Facet Grid:
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# Seaborn's point plot
plot_order = month_df.month.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = month_df, col = 'fare_type', col_wrap = 2, height = 4, aspect = 1.2, hue = 'year')
g.map(sb.pointplot, "month", "rentals", order= plot_order, linestyles = "-", ci = None);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average monthly rentals over years by fare type', fontsize = 16, weight = 'bold')
g.set_titles('{col_name}', weight = 'bold', size = 14, color = 'dimgrey')
# improve plot aesthetics
y_tick_names = []
for i, ax in enumerate(g.axes.flat):
# get y labels
for y_label in ax.get_yticklabels():
y_label_value = int(y_label.get_text().replace('−','-'))
y_label_new_value = '{:0.0f} K'.format(y_label_value/1000)
y_tick_names.append(y_label_new_value)
g.set_xticklabels(plot_order, size = 12)
g.set_yticklabels(y_tick_names, size = 12)
g.set_xlabels('\nMonth of the year', size = 14)
g.set_ylabels('Avg. Bike rentals\n', size = 14)
plt.subplots_adjust(wspace=0.05, hspace=0.2);
# add custom legend
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Year', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1.1));
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.4.3 Average monthly rentals over years by fare type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that, even though
Base fareis the most popular choice for the customers, the average number of bike rentals subjected to early 6 months of year 2019 is very low.- Also the rentals with
Extended faretype for 2019 has decreased compared to previous year.
25.5% of the bike rentals generated extra income in the form of Extended fares, which potrays a good business model. However, the average number of the bike rentals subjectd to Extended fare for the year 2019 are relatively less than 2018 and need to be increased by adopting new rentals techniques that encourage customers to ride the bikes for longer duration of time.Columns: trip_type, bike_type, pass_typeData type: categorical data, nominalPlot : Bar chartBar Chart:
def count_subplot(subplot, color, cat_type, alpha, sat):
# plot the distribution of bike rentals based on category types
#-----------------------Start of subplot-----------------------
# prepare the data for the plot
sb.set_style('darkgrid')
base_color = sb.color_palette()[color]
plt.subplot(1, 4, subplot)
max_count = bikeshare.shape[0]
y_tick_values = np.arange(0, max_count + 100000, 100000)
y_tick_names = ['{:0.1f} M'.format(v/1000000) for v in y_tick_values]
cat_order = bikeshare[cat_type].value_counts().index
# plot countplot
sb.countplot(data = bikeshare, x = cat_type, color = base_color, alpha= alpha, order = cat_order, saturation = sat)
# improve plot aesthetics
plt.title('Rentals based on {} type'.format(cat_type[0: 4].title()), fontsize = 16, weight = 'bold')
plt.xlabel('\n{} type'.format(cat_type[0: 4].title()), fontsize = 14)
plt.xticks(fontsize = 12)
if subplot == 1:
plt.ylabel('Rentals (million)\n', fontsize = 14)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
else:
plt.ylabel('')
plt.yticks(y_tick_values, [])
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
cat_type_counts = bikeshare[cat_type].value_counts()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
try:
# get the text property for the label to get the correct count
count = cat_type_counts[label.get_text()]
except KeyError:
count = 0
pct_string = '{:0.0f}%'.format(100*count/n_points)
# print the annotation depending on the bar length
if count < (n_points/10):
plt.text(loc, count + (n_points/25), pct_string, ha = 'center', color = 'black', fontsize = 13)
else:
plt.text(loc, count - (n_points/15), pct_string, ha = 'center', color = 'black', fontsize = 13);
# -------------------------------------------------------
#-------------------------End of subplot------------------------
# Assign color palette and figure size as per requirement
plt.figure(figsize = [20, 6])
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[0]
# plot syntax : count_subplot(subplot, color, cat_type, alpha, sat)
count_subplot(subplot=1, color=0, cat_type='trip_type', alpha=0.5, sat=1)
count_subplot(subplot=2, color=2, cat_type='bike_type', alpha=0.5, sat=1)
count_subplot(subplot=3, color=6, cat_type='pass_type', alpha=0.6, sat=0.8)
count_subplot(subplot=4, color=8, cat_type='fare_type', alpha=0.6, sat=1)
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.5 Comparision of bike rentals based on various categorical parameters.png', dpi=300, bbox_inches='tight')
standard bike over smart bikes, takes more One Way trips than Round Trip's, and prefers Monthly Pass over other subscriptions.Column: hourData type: continuous dataPlot : Distribution plot, Line plotAggregated Hourly distribution of bike rentals:¶plt.figure(figsize = [8, 6])
# Assign palette and grid as per requirement
sb.set_style('darkgrid')
# prepare data for the plot
x = bikeshare.groupby(bikeshare['hour']).count()['trip_id'].index
y = bikeshare.groupby(bikeshare['hour']).count()['trip_id'].values
x_tick_values = np.arange(0, 23+1, 1)
x_tick_names = ['{:}'.format(v) for v in x_tick_values]
y_tick_values = np.arange(0, bikeshare.hour.value_counts().max()+10000, 10000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
# matplotlib's line plot
plt.plot(x, y, linewidth=2.0, color = 'lightskyblue')
# improve plot aesthetics
plt.title('Aggregated Hourly distribution of bike rentals', fontsize = 16, weight = 'bold')
plt.xlabel('\nHour of the day', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(x_tick_values, x_tick_names, fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# fill the area under the line
plt.fill_between(x, y, color = 'lightskyblue')
# draw the vertical axial line at the peak hour
peak_hour = bikeshare['hour'].value_counts(ascending=False).index[0]
plt.axvline(peak_hour, color='black', alpha=0.3, linewidth=2);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.1 Aggregated distribution of bike rentals based on hour of the day.png', dpi=300, bbox_inches='tight')
The above plots depict that the most busy hours are in the evenings and plots a vertical axial line that denotes the hour with maximum aggregated bike rentals over the hour of the day, which is 5:00 PM. Let us look at average number of bike rentals for an hour in a day for more clear interpretation of trends.
Average bike rentals based on the hour of the day:¶# create a dataset for bike rentals over each hour in a day
hours_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"],
bikeshare["hour"]]).count()['trip_id'].reset_index(name='rentals')
hours_df['rentals'] = hours_df['rentals'].fillna(0).astype(int)
hours_df.head(10)
Point plot:
# Assign color palette and figure size as per requirement
plt.figure(figsize=[12,4])
sb.set_style('white')
# Seaborn's point plot
sb.pointplot(data = hours_df, x = "hour", y = "rentals", linestyles = "-", color = 'lightskyblue')
# improve plot aesthetics
plt.title('Average bike rentals based on hour of the day\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. rentals\n', fontsize = 14)
plt.xlabel('\nHour of the day', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0, max_count+10, 10)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
avg_rental_counts = hours_df.groupby([hours_df["hour"]]).mean()['rentals']
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*3)/5)) else 'limegreen' for count in avg_rental_counts ]
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, avg_rental_counts, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f}'.format(count)
# print the annotation depending on the bar length
plt.text(loc, count + int(avg_rental_max/10), pct_string, ha = 'center', color = 'black',
fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------
sb.despine(top=True, right=True, left=False, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.2 Average bike rentals based on hour of the day.png', dpi=300, bbox_inches='tight')
Observations:
- The bike rentals aggregated by the hour of the day, depicts that the rentals slowly starts to increase from
6:00 AMuntill5:00 PMwith the peaks at8:00 AM,12:00 PM, and5:00 PM, which representMorning office hours,Afternoon Lunch time, andEvening office relievetimings respectively. This concludes that the huge portion of the customer database containworking individuals, who use bikes for the transportatioin.- The average bike rentals over the hour of the day depicts that the rentals are least during night and early hours.
Average rentals based on the weekday over individual years:¶# create a dataset for bike rentals over the days in a week
weekday_df = bikeshare.groupby([bikeshare['year'],
bikeshare['month'],
bikeshare['week'],
bikeshare['weekday']]).count()['trip_id'].reset_index(name='rentals')
weekday_df.head(10)
# Assign color palette and figure size as per requirement
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# Seaborn's point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-",
hue = 'year', ci = None, order = plot_order)
# improve plot aesthetics
plt.title('Average bike rentals based on weekday of the week\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0, max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.8), loc = 6, labelspacing=0.5,
title='Year', title_fontsize=12, fontsize=10, facecolor='white', markerfirst=True,
handlelength=2, handletextpad=0.5)
# draw the vertical axial lines
plt.axhline(500, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(600, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(700, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(800, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(900, color='grey', alpha=1, linewidth=0.5, linestyle='--')
sb.despine(top=True, right=True, left=False, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.3 Average bike rentals based on day of the week over years.png', dpi=300, bbox_inches='tight')
Observations:
- The above plots depicts that the bike rentals decrease during the non-working days such as
SaturdayandSunday. This reinforces the argument that the majority of the customer base consists of working individuals.- However the year 2019 experieces a steep decrease in bike rentals during the non-working days compared to previous years. This reflects the failure in attraction of tourists and non-subscribers to ride a bike over weekends.
Average bike rentals based on hour of the day over years by trip type and pass type:¶# create a dataset for bike rentals over each hour in a day by trip type and pass type
hours_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"],
bikeshare["hour"],
bikeshare["trip_type"],
bikeshare["pass_type"]]).count()['trip_id'].reset_index(name='rentals')
hours_df.head(10)
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
# Facet grid with point plot
plot_order = hours_df.hour.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = hours_df, col = 'pass_type', row = 'year', height = 4, aspect = 1, hue = 'trip_type')
g.map(sb.pointplot, "hour", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.85)
g.fig.suptitle('Average bike rentals based on hour of the day over years by trip type and pass type',
fontsize = 16, weight = 'bold')
g.set_titles('Year = {row_name} | Pass = {col_name}', weight = 'bold', size = 14, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nHour of the day', size = 14)
g.set_ylabels('Avg. bike rentals\n', size = 14)
g.set_yticklabels(size = 12)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
# get x labels
labels = ax.get_xticklabels()
for i,l in enumerate(labels):
# skip labels
if not (i%5 == 0): labels[i] = ''
# set new labels
ax.set_xticklabels(labels, size = 12)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2)]
plt.legend(custom, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 3, title='Trip Type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(-1.20, 4.1));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.6.4 Average bike rentals based on hour of the day over years by trip type and pass type.png', dpi=300, bbox_inches='tight')
Observation:
The above plot depicts that the average number of bike rentals taken by non-subscribers and tourists (
Walk-uppass orOne Daypass) is less than the number of bike rentals taken byWorking Individuals(withMonthlypass). This reinforces the argument that the majority of customer database is compromised of working individuals.
6:00 AM untill 5:00 PM with the peaks at 8:00 AM, 12:00 PM, and 5:00 PM, which represent Morning office hours, Afternoon Lunch time, and Evening office relieve timings respectively. This concludes that the huge portion of the customer database contain working individuals, who use bikes for the transportatioin.Saturday and Sunday. This reinforces the argument that the majority of the customer base consists of working individuals.Walk-up pass or One Day pass) is less than the number of bike rentals taken by Working Individuals (with Monthly pass). This reinforces the argument that the majority of customer database is compromised of working individuals.Column: hourData type: continuous dataPlot : Distribution plot, Line plotAverage bike rentals based on the time of the day:¶# create a dataset for bike rentals for each daytime of the day
daytime_df = bikeshare.groupby([bikeshare['year'],
bikeshare['month'],
bikeshare['day'],
bikeshare['daytime']]).count()['trip_id'].reset_index(name='rentals')
daytime_df['rentals'] = daytime_df['rentals'].fillna(0).astype(int)
daytime_df.head(10)
Point plot:
# Assign color palette and grid as per requirement
sb.set_style('white')
# Seaborn's point plot
sb.pointplot(data = daytime_df, x = "daytime", y = "rentals", linestyles = "-", color = 'lightskyblue')
# improve plot aesthetics
plt.title('Avg. bike rentals based on daytime\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nSection of the day', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0, max_count+50, 50)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
cat_order = daytime_df.daytime.sort_values(ascending=True).unique()
avg_rental_counts = daytime_df.groupby([daytime_df["daytime"]]).mean()['rentals'][cat_order]
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*4)/5)) else 'limegreen' for count in avg_rental_counts ]
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, avg_rental_counts, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f}'.format(count)
# print the annotation depending on the bar length
plt.text(loc, count + int(avg_rental_max/10), pct_string, ha = 'center', color = 'black',
fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------
sb.despine(top=True, right=True, left=False, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.7.1 Average bike rentals based on time of the day.png', dpi=300, bbox_inches='tight')
Afternoon, with Morining and Evening being closest. This denotes that the customers use bike rentals the most during daytime.Subsequently the rental activity is least at Early Hours and Night times. Early Hours. Night-time rentals.Column: dayData type: continuous dataPlot : Distribution plot, Line plotAggregated bike rentals based on the day of the month:¶# Assign figure size and color palette as per requirement
plt.figure(figsize = [18, 6])
sb.set_style('darkgrid')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
clr = sb.color_palette()[4]
# prepare data for the plot
day_index_max = bikeshare.day.sort_values(ascending=False).unique()[0]
daily_order = np.arange(1, day_index_max+1, 1)
max_count = bikeshare.day.value_counts().max()
min_count = bikeshare.day.value_counts().min()
tick_values = np.arange(0, max_count+10000, 10000)
tick_names = ['{:0.0f} K'.format(v/1000) for v in tick_values]
day_values = bikeshare.day.value_counts().values
clrs = ['thistle' if (x > min_count) else clr for x in day_values]
# Seaborn's count plot
sb.countplot(data = bikeshare, x = 'day', palette=clrs,
alpha= 1, order = daily_order, saturation = 0.8)
# improve plot aesthetics
plt.title('Aggregative distribution of bike rentals based on day of the month', fontsize = 16, weight = 'bold')
plt.xlabel('\nDay of the month', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(tick_values, tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
daily_counts = bikeshare.day.value_counts()
daily_max = daily_counts.max()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
try:
count = daily_counts[int(label.get_text())]
except KeyError:
count = 0
pct_string = '{:0.1f}%'.format(100*count/n_points)
# print the annotation depending on the bar length
if count < (daily_max/10):
plt.text(loc, count + (daily_max/40), pct_string, ha = 'center', color = 'black', fontsize = 12)
else:
plt.text(loc, count + (daily_max/40), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.8.1 Aggregated distribution of bike rentals based on day of the month.png', dpi=300, bbox_inches='tight')
The above plots depicts that the rentals decrease during the end of the month, especially on 31'st of the month. The bike rentals are categorized over the day of the month, the distribution is calculated based on the cummulative summation of day over 3 years but not individual month. Hence, there are only 21 occurances of day 31st, while other days has an occurance of 36 over the time period of 3 years (2017-2019), except for days 29 and 30 which counts to 33 due to their absence in February month. This denotes that the rate of rentals is not actually low on 31st compared to other days. Let us perform a more detailed analysis by calculating the average bike rentals based on day of the month.
Average rentals based on the day of the month:¶Create a dataset which contain bike rentals relative to each day in the month over respective years. Care should be taken as not to inlcude the day 31st in every month of the year. Use only the unique appearences of categorical combinations and do not fill the NULL values with numerical zero's so as to consider bike rentals of day 31st on certain months only but not in every month.
# create a dataset for bike rentals over the days of the month
days_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["day"]]).size().reset_index(name='rentals')
days_df.tail(10)
Check the appearances of individual days over the dataset created:
cat_order = days_df.day.sort_values(ascending=True).unique()
print('Month - Occurances')
days_df.day.value_counts()[cat_order]
The above cell depicts that the days 29, 30, and 31 has relatively less appearences compared to the other days in the month. This confirms the reliability of the dataset to calculate the average bike rentals based on day of the month.
Point plot:
# Assign grid and figure size as per requirement
plt.figure(figsize=[12,4])
sb.set_style('whitegrid')
# Seaborn's point plot
sb.pointplot(data = days_df, x = "day", y = "rentals", linestyles = "-", color = 'lightskyblue', ci=None)
# improve plot aesthetics
plt.title('Avg. bike rentals based on day of the month\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. rentals\n', fontsize = 14)
plt.xlabel('\nDay of the month', fontsize = 14)
sb.despine(top=True, right=True, left=True, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.8.2 Average bike rentals based on day of the month.png', dpi=300, bbox_inches='tight')
On contrary to the previous plots, the above plot depicts that the days subjected to end of the month have relatively high average bike rentals compared to most of the days in the month. However the above plot is not potrayed with respect to zero on the axis and amplifies the difference between the average rentals for any given day in the month. Re-plot the above graph with respect to zero over y-aixs.
Average rentals based on the day of the month over Zero:¶# Assign grid and figure size as per requirement
plt.figure(figsize=[12,4])
sb.set_style('white')
# Seaborn's point plot
sb.pointplot(data = days_df, x = "day", y = "rentals", linestyles = "-", color = 'lightskyblue', ci=None)
# improve plot aesthetics
plt.title('Avg. bike rentals based on day of the month\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the month', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0, max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# draw the vertical axial lines
plt.axhline(700, color='black', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(800, color='black', alpha=1, linewidth=0.5, linestyle='--')
sb.despine(top=True, right=True, left=True, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.8.3 Average bike rentals based on day of the month.png', dpi=300, bbox_inches='tight')
700 and 800 only. This depicts that there is no significant differance in average bike rentals subjected to any two days given in a month.Column: day of the weekData type: continuous dataPlot : Distribution plot, Line plotAggregated bike rentals based on the day of the week:¶# Assign figure size and color palette as per requirement
plt.figure(figsize = [8, 6])
sb.set_style('white')
# prepare data for the plot
day_order = bikeshare.weekday.value_counts().index
max_count = bikeshare.weekday.value_counts().max()
min_count = bikeshare.weekday.value_counts().min()
mean_count = bikeshare.weekday.value_counts().mean()
y_tick_values = np.arange(0, max_count+25000, 25000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
weekday_values = bikeshare.weekday.value_counts().values
clrs = ['#aeebe5' if (x > mean_count) else 'cyan' for x in weekday_values ]
# Seaborn's count plot
sb.countplot(data = bikeshare, x = 'weekday', palette=clrs,
alpha= 0.5, order = day_order, saturation = 0.5)
# improve plot aesthetics
plt.title('Aggregated distribution of bike rentals over the weekday\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nDay of the week', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
n_points = bikeshare.shape[0]
day_counts = bikeshare.weekday.value_counts(ascending=False).values
day_max = day_counts.max()
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label in zip(locs, labels):
# get the text property for the label to get the correct count
try:
count = day_counts[loc]
pct_string = '{:0.1f}%'.format(100*count/n_points)
except KeyError:
count = 15000
pct_string = '0%'
# print the annotation depending on the bar length
if count < (day_max/10):
plt.text(loc, count+(day_max/25), pct_string, ha = 'center', color = 'black', fontsize = 12)
else:
plt.text(loc, count-(day_max/15), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.1 Aggregated distribution of bike rentals over the week.png', dpi=300, bbox_inches='tight')
The aggregated distribution of bike rentals over the week depicts that the weekends say particularly Saturday and Sunday have relatively low bike rentals (less than mean value of the total bike rentals over a week) compared to other days in the week. This effect is the result of having majority of the customer database containing working employees who use bikes as a ride to the work.
However, the occurances of the weekday have some effect on the aggregated rentals (not all weekdays have same number of occurrences in a month), hence calculate the average bike rentals over weekday for more clear analysis.
Average bike rentals based on the weekday:¶Create a dataset which contain bike rentals relative to day of the week over respective months in any year. Care should be taken as to inlcude all days in every week of the month. Use all categorical combinations and fill the NULL values with numerical zero's so as to consider bike rentals subjected to each day in any week.
# create a dataset for bike rentals over the days in a week
weekday_df = bikeshare.groupby([bikeshare['year'],
bikeshare['month'],
bikeshare['week'],
bikeshare['weekday']]).count()['trip_id'].reset_index(name='rentals')
weekday_df['rentals'] = weekday_df['rentals'].fillna(0).astype(int)
weekday_df.head(10)
Point plot:
# Assign palette and figure size as per requirement
plt.figure(figsize=[8,4])
sb.set_style('white')
flatui = ['cyan']
sb.set_palette(flatui, n_colors=1, desat=0.5)
base_color = sb.color_palette()[0]
# Seaborn's point plot
sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", color = base_color)
# improve plot aesthetics
plt.title('Avg. bike rentals based on weekday\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nWeekday', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0, max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
cat_order = weekday_df.weekday.sort_values(ascending=True).unique()
avg_rental_counts = weekday_df.groupby([weekday_df["weekday"]]).mean()['rentals'][cat_order]
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*9)/10)) else '#d479a3' for count in avg_rental_counts ]
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, avg_rental_counts, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f}'.format(count)
# print the annotation depending on the bar length
plt.text(loc, count + int(avg_rental_max/10), pct_string, ha = 'center', color = 'black',
fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------
# draw the vertical axial lines
plt.axhline(600, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(700, color='grey', alpha=1, linewidth=0.5, linestyle='--')
sb.despine(top=True, right=True, left=True, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.2 Average bike rentals based on day of the week.png', dpi=300, bbox_inches='tight')
The above plot depicts that the distribution of average bike rentals over the day of the week, mostly which ranges between 600 and 700. The yellow annotations represent the busy days of the week. This depicts that there is slight decrease in average bike rentals towards the weekend (saturday, sunday) while Friday apeears to be most busiest day of the week.
However, the average is calculated based on summation of all rentals over 3 years. Perform an individual analysis for more clear insights.
Average rentals based on the weekday over individual years:¶# Assign color palette and figure size as per requirement
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#6b8a99', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# Seaborn's point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
ax = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-",
hue = 'year', ci = None, order = plot_order)
# improve plot aesthetics
plt.title('Average bike rentals based on weekday over years\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
max_count = locs.max()
y_tick_values = np.arange(0, max_count+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.8), loc = 6, labelspacing=0.5,
title='Year', title_fontsize=12, fontsize=10, facecolor='white', markerfirst=True,
handlelength=2, handletextpad=0.5)
# Add a inverted triangle marker at desired data point
ax.lines[0].set_alpha(0.8)
ax.lines[1].set_alpha(0.8)
ax.lines[2].set_markevery(every=[5,6])
ax.lines[2].set_marker('v')
ax.lines[2].set_markersize(12)
ax.lines[2].set_markeredgewidth(3)
ax.lines[2].set_markerfacecolor('lightskyblue')
ax.lines[2].set_markeredgecolor('indianred')
# draw the vertical axial lines
plt.axhline(500, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(600, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(700, color='grey', alpha=1, linewidth=0.5, linestyle='--')
plt.axhline(800, color='grey', alpha=1, linewidth=0.5, linestyle='--')
sb.despine(top=True, right=True, left=False, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.3 Average bike rentals based on day of the week over years.png', dpi=300, bbox_inches='tight')
Observation:
The above plots depicts that the years 2017, and 2018 have a relatively slight decrease in average bike rentals compared to other days in the week, however the year
2019experience asudden dropin average bike rentals during weekends saySaturdayandSunday. This is not a good sign for a healthy business model and requires reforms.
Reform:
Organizing/promoting, fitness/recreational activities like
Bike rallieswill potentially increase the bike rentals on the weekends/holidays, significantly.
Let us take a look at the other factors that influence the bike rentals over weekday:
Average bike rentals based on the weekday and trip type:¶Every fifth week in a month won't have all the occurances of the weekday as the month's are limited by uneven equal distribution of 7 day span (number of days in a week). Hence in order to accurately calculate the average rentals of the weekday, use size() method, which takes only the unique combinations in the occurances and ignores occurances with NULL values.
# create a dataset for bike rentals over each weekday in a week categorized by trip type
weekday_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["week"],
bikeshare["weekday"],
bikeshare["trip_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Point plot:
def assign_clrs(counts):
clr_list = []
for i in range(len(counts)):
try:
if counts[i] > counts[i-1]:
clr_list.append('mediumseagreen')
else:
clr_list.append('salmon')
except KeyError:
clr_list.append('mediumseagreen')
return clr_list
def assign_df(dataframe, column):
index_list1 = []
index_list2 = []
df = dataframe.reset_index()
for i in range(df.shape[0]):
if df.iloc[i].rentals > df.iloc[i-1].rentals:
index_list1.append(i)
else:
index_list2.append(i)
inc_df = df.loc[index_list1,:]
dec_df = df.loc[index_list2,:]
level_order = categorical_df[column].sort_values(ascending=True).unique()
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
inc_df[column] = inc_df[column].astype(ordered_cat)
dec_df[column] = dec_df[column].astype(ordered_cat)
return inc_df, dec_df
# Assign figure size and color palette
plt.figure(figsize=[8, 5])
sb.set_style('white')
flatui = ['#e36297', '#42d4be']
sb.set_palette(flatui, n_colors=2, desat=0.8)
base_color = sb.color_palette()[0]
# Saborn's pointplot
ax1 = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = ['-', '-'],
hue = 'trip_type', ci = None, markers=["", ""])
# add annotations
# -------------------------------------------------------
avg_rentals = weekday_df.groupby([weekday_df["trip_type"], weekday_df["weekday"]]).mean()['rentals'].reset_index()
avg_rentals_base = avg_rentals.query(' trip_type == "One Way" ')
avg_rentals_extended = avg_rentals.query(' trip_type == "Round Trip" ')
# get the current tick locations and labels
locs, labels = plt.xticks()
for categorical_df in [avg_rentals_base, avg_rentals_extended]:
avg_rental_counts = categorical_df['rentals']
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*9)/10)) else 'grey' for count in avg_rental_counts ]
inc_df, dec_df = assign_df(categorical_df, 'weekday')
sb.pointplot(data = inc_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None,
color = 'green', markers = ["^"], ax = ax1);
sb.pointplot(data = dec_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None,
color = 'red', markers = ["v"], ax = ax1);
# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, categorical_df.rentals, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f}'.format(count)
indent = 40
# print the annotation depending on the bar length
plt.text(loc, count+indent, pct_string, ha = 'center', color = 'black',
fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom_lines = [Line2D([0], [0], color=sb.color_palette()[0], lw=2),
Line2D([0], [0], color=sb.color_palette()[1], lw=2)]
plt.legend(custom_lines, ['One Way', 'Round Trip'], scatterpoints=1, frameon=True, fancybox=True, shadow=False,
ncol = 1, framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1));
# -------------------------------------------------------
# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average weekday bike rentals categorized by trip type\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# -------------------------------------------------------
for loc in y_tick_values:
plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.3)
sb.despine(top=True, right=True, left=False, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.4 Average weekday bike rentals categorized by trip type.png', dpi=300, bbox_inches='tight')
In the above plot, the yellow annotations depicts the busy days of the week, while markers depicts whether the respctive day's bike rentals increased/decreased in comparision with the previous day.
Observation:
The above plot depicts that even if the customers that take
One Waytrips (probably working individuals who ride to work) decreases over weekends, the customers that takeRound Tripsincreases during the weekends. This is clearly evident in the above plot subjected toRound Trips, whereyellow annotationsdepict the busy days of the week, whilemarkersdenote the respctive day's rentals in comparision with the previous day.
This behaviour is strongly reinforced by the plot that depicts the average bike rentals categorized by fare types over the weekday.
Average bike rentals based on the weekday and fare type:¶# create a dataset for bike rentals over each weekday in a week categorized by fare type
weekday_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["week"],
bikeshare["weekday"],
bikeshare["fare_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Point plot:
def assign_clrs(counts):
clr_list = []
for i in range(len(counts)):
try:
if counts[i] > counts[i-1]:
clr_list.append('mediumseagreen')
else:
clr_list.append('salmon')
except KeyError:
clr_list.append('mediumseagreen')
return clr_list
def assign_df(dataframe, column):
index_list1 = []
index_list2 = []
df = dataframe.reset_index()
for i in range(df.shape[0]):
if df.iloc[i].rentals > df.iloc[i-1].rentals:
index_list1.append(i)
else:
index_list2.append(i)
inc_df = df.loc[index_list1,:]
dec_df = df.loc[index_list2,:]
level_order = categorical_df[column].sort_values(ascending=True).unique()
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
inc_df[column] = inc_df[column].astype(ordered_cat)
dec_df[column] = dec_df[column].astype(ordered_cat)
return inc_df, dec_df
# Assign figure size and color palette
plt.figure(figsize=[8, 5])
sb.set_style('white')
flatui = ['#577da1', '#eda668']
sb.set_palette(flatui, n_colors=2, desat=0.6)
base_color = sb.color_palette()[0]
# Seaborn's point plot
ax1 = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = ["-", "-"], hue = 'fare_type',
scale = 1, ci = None, markers=["", ""])
# add annotations
# -------------------------------------------------------
avg_rentals = weekday_df.groupby([weekday_df["fare_type"], weekday_df["weekday"]]).mean()['rentals'].reset_index()
avg_rentals_base = avg_rentals.query(' fare_type == "Base" ')
avg_rentals_extended = avg_rentals.query(' fare_type == "Extended" ')
# get the current tick locations and labels
locs, labels = plt.xticks()
for categorical_df in [avg_rentals_base, avg_rentals_extended]:
avg_rental_counts = categorical_df['rentals']
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (count > ((avg_rental_max*4)/5)) else 'grey' for count in avg_rental_counts ]
inc_df, dec_df = assign_df(categorical_df, 'weekday')
sb.pointplot(data = inc_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None,
color = 'green', markers = ["^"], ax = ax1);
sb.pointplot(data = dec_df, x = "weekday", y = "rentals", linestyles = "", scale = 1, ci = None,
color = 'red', markers = ["v"], ax = ax1);
# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, categorical_df.rentals, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f}'.format(count)
indent = 40
# print the annotation depending on the bar length
plt.text(loc, count+indent, pct_string, ha = 'center', color = 'black',
fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------
# add custom legend
custom_lines = [Line2D([0], [0], color=sb.color_palette()[0], lw=2),
Line2D([0], [0], color=sb.color_palette()[1], lw=2)]
plt.legend(custom_lines, ['Base', 'Extended'], scatterpoints=1, frameon=True, fancybox=True, shadow=False,
ncol = 1, framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,
title='Fare type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1));
# improve plot aesthetics
plt.title('Average weekday bike rentals categorized by fare type\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# plt.xlim(ax1.get_xlim())
plt.xticks(fontsize = 12);
# plot custom grid-lines
for loc in y_tick_values:
plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.3)
sb.despine(top=True, right=True, left=False, bottom=False);
# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.5 Average weekday bike rentals categorized by fare type.png', dpi=300, bbox_inches='tight')
In the above plot, the yellow annotations depicts the busy days of the week, while markers depicts whether the respctive day's bike rentals increased/decreased in comparision with the previous day.
Observation:
The above plot depicts that customers tend to travel for longer durations (bike rentals with extended fares) during the weekends.
Average bike rentals based on the weekday and fare type:¶Let us observe the effect of the customer's pass type on the bike rentals over the weekend.
# create a dataset for bike rentals over each weekday in a week categorized by pass type
weekday_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["week"],
bikeshare["weekday"],
bikeshare["pass_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Point plot:
def assign_clr(pass_type):
if (pass_type == "Walk-up"): return sb.color_palette()[0]
elif (pass_type == "One Day"): return sb.color_palette()[1]
elif (pass_type == "Monthly"): return sb.color_palette()[2]
elif (pass_type == "Flex"): return sb.color_palette()[3]
elif (pass_type == "Annual"): return sb.color_palette()[4]
return 'gold'
# Assign figure size and color palette
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ["#34e0c7", "#c271e3", "#4cb1f5", "#e06458", "#34495e"]
sb.set_palette(flatui, n_colors=5, desat=0.6)
base_color = sb.color_palette()[0]
# Seaborn's pointplot
ax = sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-",
hue = 'pass_type', scale = 1, ci = None)
# improve plot aesthetics
# -------------------------------------------------------
plt.title('Average weekday bike rentals categorized by pass type\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
sb.despine(top=True, right=True, left=False, bottom=False);
# -------------------------------------------------------
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1,
framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1, 1))
# plot custom grid lines
for loc in y_tick_values:
plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.5)
# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.6 Average weekday bike rentals categorized by pass type.png', dpi=300, bbox_inches='tight')
Observation:
The above plot depicts that the customers with pass types
MonthlyandAnnualare less likely to ride a bike during the weekends. This is may be because of the customer base compromising of working individuals. However, the number of customers that prefer to takeWalk-upandOne Daypasses for a ride increases during the weekends. This denotes that the weekends attract customers other than working individuals i.e, tourists/activists who enjoy taking a ride for the sightseeing.
Average bike rentals based on the weekday and bike type:¶Let us observe the effect of the customer's bike preference on the bike rentals over the weekend.
# create a dataset for bike rentals over each weekday in a week categorized by bike type
weekday_df = bikeshare.groupby([bikeshare["year"],
bikeshare["month"],
bikeshare["week"],
bikeshare["weekday"],
bikeshare["bike_type"]]).size().reset_index(name='rentals')
weekday_df.head(10)
Point plot:
def assign_clr(bike):
if (bike == "unknown"): return sb.color_palette()[0]
elif (bike == "Standard"): return sb.color_palette()[1]
elif (bike == "Electric"): return sb.color_palette()[2]
elif (bike == "Smart"): return sb.color_palette()[3]
return 'gold'
# Assign palette as per requirement
plt.figure(figsize=[8,5])
sb.set_style('white')
flatui = ['#ff7ddd', '#77f7cc', '#4b99eb', '#aa75fa']
sb.set_palette(flatui, n_colors=4, desat=0.6)
base_color = sb.color_palette()[0]
# Seaborn's point plot
sb.pointplot(data = weekday_df, x = "weekday", y = "rentals", linestyles = "-", hue = 'bike_type',
scale = 1, ci = None)
# improve plot aesthetics
plt.title('Average weekday bike rentals categorized by bike type\n', weight = 'bold', fontsize = 16)
plt.ylabel('Avg. bike rentals\n', fontsize = 14)
plt.xlabel('\nDay of the week', fontsize = 14)
plt.xticks(fontsize = 12)
# get ytick locs and rearrage them with respect to zero
locs, labels = plt.yticks()
weekday_rental_avg_max = 800
y_tick_values = np.arange(0, weekday_rental_avg_max+100, 100)
y_tick_names = ['{:0.0f}'.format(v) for v in y_tick_values]
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
# add annotations
# -------------------------------------------------------
avg_rentals = weekday_df.groupby([weekday_df["bike_type"], weekday_df["weekday"]]).mean()['rentals'].reset_index()
avg_rentals_max = avg_rentals.rentals.max()
avg_rentals_unknown = avg_rentals.query(' bike_type == "unknown" ')
avg_rentals_standard = avg_rentals.query(' bike_type == "Standard" ')
avg_rentals_electric = avg_rentals.query(' bike_type == "Electric" ')
avg_rentals_smart = avg_rentals.query(' bike_type == "Smart" ')
# get the current tick locations and labels
locs, labels = plt.xticks()
for categorical_df in [avg_rentals_unknown, avg_rentals_standard, avg_rentals_electric, avg_rentals_smart]:
clrs = [assign_clr(bike) for bike in categorical_df.bike_type]
# loop through each pair of locations and labels
for loc, label, avg_rental_count, clr in zip(locs, labels, categorical_df.rentals, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f}'.format(count)
indent = 40
# print the annotation depending on the bar length
plt.text(loc, count + indent, pct_string, ha = 'center', color = 'black',
fontsize = 12, bbox=dict(pad=1.9,alpha=0.2,color='none',fc=clr))
# -------------------------------------------------------
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1,
framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.05, 1))
sb.despine(top=True, right=True, left=False, bottom=False);
# plot custom grid lines
for loc in y_tick_values:
plt.axhline(loc, ls='--', color='grey', linewidth=0.5, alpha=0.5);
# savefig by passing (bbox_inches='tight'),which will adjust dhe figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.7 Average weekday bike rentals categorized by bike type.png', dpi=300, bbox_inches='tight')
However let us explore the combined effect of pass_type and bike_type on the bike rentals based on weekday over the individual years.
Observation:
The above plot depicts that the bike rentals subjected to
Standard bike, andElectric Bikeare more during weekdays. This is because of the working individual customer database. However, the weekends attract customers that preferSmart bikes.
Average bike rentals based on the weekday over years by bike type and pass type:¶# create a dataset for bike rentals for each day in a week over the years by pass type and bike type
temp_df = bikeshare.query(' pass_type == "One Day" or pass_type == "Monthly" or pass_type == "Annual" ').copy()
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['bike_type'] = temp_df['bike_type'].astype(ordered_cat)
level_order = ['One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
temp_df['pass_type'] = temp_df['pass_type'].astype(ordered_cat)
weekday_df = temp_df.groupby([temp_df["year"],
temp_df["month"],
temp_df["week"],
temp_df["weekday"],
temp_df["bike_type"],
temp_df["pass_type"]]).count()['trip_id'].reset_index(name='rentals')
weekday_df.head(10)
Facet Grid:
# Assign palette as per requirement
sb.set_style('white')
flatui = ['#cc3f9d', '#31404a', '#60b6f0']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# Facet grid with point plot
plot_order = weekday_df.weekday.sort_values(ascending=True).unique()
g = sb.FacetGrid(data = weekday_df, col = 'pass_type', row = 'bike_type', margin_titles=True, height = 3, aspect = 1, hue = 'year')
g.map(sb.pointplot, "weekday", "rentals", order= plot_order, linestyles = "-", ci = None, markers = ['.']);
g.fig.subplots_adjust(top=0.8)
g.fig.suptitle('Average bike rentals based on day of the week over years by bike type and pass type',
fontsize = 16, weight = 'bold')
g.set_titles('Bike = {row_name} | Pass = {col_name}', weight = 'bold', size = 12, color = 'dimgrey')
# improve plot aesthetics
# -------------------------------------------------------
g.set_xlabels('\nDay of the week', size = 12)
g.set_ylabels('Avg. bike rentals\n', size = 12)
g.set_yticklabels(size = 10)
# iterate over axes of FacetGrid
for ax in g.axes.flat:
# get x labels
labels = ax.get_xticklabels()
# set new labels
ax.set_xticklabels(labels, rotation = 30, size = 10)
# -------------------------------------------------------
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], marker='.', color=sb.color_palette()[0], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[1], linestyle='-', linewidth = 2),
Line2D([], [], marker='.', color=sb.color_palette()[2], linestyle='-', linewidth = 2)]
plt.legend(custom, ['2017', '2018', '2019'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 3, title='Year', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.1, 4.3));
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.05, hspace=0.3);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.9.8 Average bike rentals based on weekday over the years by bike type and pass type.png', dpi=300, bbox_inches='tight')
Monthly pass type and Standard bike type experice a decrement towards weekends.Monthly pass type and Electric bike type experice a decrement towards weekends.One Day pass type and Standard bike type experice a increment towards weekends.One Day pass type and Electric bike type experice a increment towards weekends.Smart bike type irrespective of pass type experience a slight increment towards weekends.Observations:
- This reflects that the
pass_typeholds a stronger influence on the bike rentals over the week rather thanbike_type.- This reinforces the argument that the monthly pass is preferred by working individuals and experiences a decrement in average bike rentals over weekends say non-working days and preferred
StandardandElectricbikes.- The One Day pass attarcts new/temporary customers like tourists or explorers and experiences an increment in average bike rentals over weekends otherwise a holiday and preferred
StandardandSmartbikes.- The smart bike experiences a slight increase in average bike rentals over weekends.
2019 experience a sudden drop in average bike rentals during weekends say Saturday and Sunday. This is not a good sign for a healthy business model and requires reforms. Annual pass and Monthly pass prefer Standard bikes and Electric Bikes to travel during working days/weekdays and less likely to travel during weekends. As the customers database contain a majority of working individuals, they tend to prefer One Way trips which decreases during weekends. One Way trips (probably working individuals who ride to work) decreases over weekends, the customers that take Round Trips increases during the weekends.pass_type holds a stronger influence on the bike rentals over the week rather than bike_type.One Day pass and prefer Standard bikes and Smart bikes. Hence Smart bikes experince highest bike rentals during the weekends. Also this category of customers tend to take Round Trips and ride for longer durations resulting in Extended fares thus generating more income to the company.Bike rallies will potentially increase the bike rentals on the weekends/holidays, significantly.One Day pass who prefer Standard bikes reduced significantly during 2019. Hence attracting this category customers to use standard bikes will enhance the business model significantly.Column: start_lat, start_lon, end_lat, end_lonData type: numerical, continuousPlot : Heat mapExploration of geographical distribution of bike rentals based on start station's co-ordinates:¶# Assign figure size as per requirement
plt.figure(figsize = [8, 4])
h2d = plt.hist2d(data = bikeshare, x = 'start_lat', y = 'start_lon', cmin = 0.5, cmap = 'viridis_r')
# improve plot aesthetics
plt.title('Geographical distribution of Start stations\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nLatitude', fontsize = 14)
plt.ylabel('Longitude\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
# add annotations
# -------------------------------------------------------
# getting individual elements
counts = h2d[0]
x_bins = h2d[1]
y_bins = h2d[2]
counts_list = []
x_bin_diff_list = []
y_bin_diff_list = []
for i in range(counts.shape[0]):
for j in range(counts.shape[1]):
c = counts[i,j]
# eliminate nan and append only if c does not exist in counts_list
if c not in counts_list and not np.isnan(c):
counts_list.append(c)
for bin in range(len(x_bins)-1):
x_bin_diff = x_bins[bin+1] - x_bins[bin]
if x_bin_diff not in x_bin_diff_list:
x_bin_diff_list.append(x_bin_diff)
for bin in range(len(y_bins)-1):
y_bin_diff = y_bins[bin+1] - y_bins[bin]
if y_bin_diff not in y_bin_diff_list:
y_bin_diff_list.append(y_bin_diff)
counts_mean = np.mean(counts_list)
x_bin_size = max(x_bin_diff_list)
y_bin_size = max(y_bin_diff_list)
for i in range(counts.shape[0]):
for j in range(counts.shape[1]):
c = counts[i,j]
if c >= counts_mean: # increase visibility on darkest cells
plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
ha = 'center', va = 'center', color = 'white', fontsize = 9)
elif c > 0:
plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
ha = 'center', va = 'center', color = 'black', fontsize = 9)
# -------------------------------------------------------
plt.colorbar();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.1 Geographical distribution of Start stations.png', dpi=300, bbox_inches='tight')
The above plot depict that some start stations and end stations have clusters that constitute less than 100 bike rentals over 3 years period of time. These stations are advised to be either relocated or shutdown as the maintaince is significantly more than income generated.
Exploration of geographical distribution of bike rentals based on end station's co-ordinates:¶# Assign figure size as per requirement
plt.figure(figsize = [8, 4])
h2d = plt.hist2d(data = bikeshare, x = 'end_lat', y = 'end_lon', cmin = 0.5, cmap = 'viridis_r')
# improve plot aesthetics
plt.title('Geographical distribution of End stations\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nLatitude', fontsize = 14)
plt.ylabel('Longitude\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12)
# add annotations
# -------------------------------------------------------
# getting individual elements
counts = h2d[0]
x_bins = h2d[1]
y_bins = h2d[2]
counts_list = []
x_bin_diff_list = []
y_bin_diff_list = []
for i in range(counts.shape[0]):
for j in range(counts.shape[1]):
c = counts[i,j]
# eliminate nan and append only if c does not exist in counts_list
if c not in counts_list and not np.isnan(c):
counts_list.append(c)
for bin in range(len(x_bins)-1):
x_bin_diff = x_bins[bin+1] - x_bins[bin]
if x_bin_diff not in x_bin_diff_list:
x_bin_diff_list.append(x_bin_diff)
for bin in range(len(y_bins)-1):
y_bin_diff = y_bins[bin+1] - y_bins[bin]
if y_bin_diff not in y_bin_diff_list:
y_bin_diff_list.append(y_bin_diff)
counts_mean = np.mean(counts_list)
x_bin_size = max(x_bin_diff_list)
y_bin_size = max(y_bin_diff_list)
for i in range(counts.shape[0]):
for j in range(counts.shape[1]):
c = counts[i,j]
if c >= counts_mean: # increase visibility on darkest cells
plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
ha = 'center', va = 'center', color = 'white', fontsize = 9)
elif c > 0:
plt.text(x_bins[i] + (x_bin_size/2), y_bins[j] + (y_bin_size/2), int(c),
ha = 'center', va = 'center', color = 'black', fontsize = 9)
# -------------------------------------------------------
plt.colorbar();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.2 Geographical distribution of End stations.png', dpi=300, bbox_inches='tight')
The above plot depict that some end stations and end stations have clusters that constitute less than 100 bike rentals over 3 years period of time. These stations are advised to be either relocated or shutdown as the maintaince is significantly more than income generated.
Identification of stations which are financially liable for high maintenance:¶Any station has more than one combination of lattitude and longitude.This is because of the geographical extension of the stations over the zone. Hence the bike traffic are to be calculated over the station_id but not the combination of lattitude and longitude.
The stations with least activity are to be accurately identified based on their individual bike rental traffic and bike return traffic. Because some stations might have lower bike rentals but compensates its significance by having high bike return traffic and vice versa. Hence only the stations with lower activity (bike rentals and returns combined) are to be deemed as higher maintenance and eligible for termination.
Create the dataframe with the bike rentals based on start_station_id.
# find the rentals based on start_station_id
start_stations = bikeshare.groupby([bikeshare['start_station_id']]).size().reset_index(name='rentals')
start_stations.rename(columns={'start_station_id':'station_id'}, inplace=True)
start_stations.head()
Create the dataframe with the bike returns based on end_station_id.
# find the rentals based on end_station_id
end_stations = bikeshare.groupby([bikeshare['end_station_id']]).size().reset_index(name='returns')
end_stations.rename(columns={'end_station_id':'station_id'}, inplace=True)
end_stations.head()
Combine the two dataframes into a single dataframe.
stations = pd.merge(start_stations, end_stations, on='station_id', how='outer')
stations = stations.fillna(0)
stations.head()
Plot the distribution of bike rentals and bike returns for investigation.
# Assign color palette and figure size as per requirement
plt.figure(figsize = [6, 4])
sb.set_style('white')
sb.set_palette(palette = 'colorblind', n_colors = 10, desat = None)
base_color = sb.color_palette()[0]
# Seaborn's regplot
sb.regplot(data = stations, x = 'rentals', y = 'returns', fit_reg = True,
scatter_kws = {'alpha' : 1/10}, line_kws = {'alpha' : 0.2}, color = base_color);
# improve plot aestetics
plt.title("Distribution of bike station's traffic\n", fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('Bike Returns (thousands)\n', fontsize = 14)
# get xtick locs and rearrage them with respect to zero
x_locs, x_labels = plt.xticks()
x_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in x_locs]
plt.xticks(x_locs, x_tick_names, fontsize = 12)
# get ytick locs and rearrage them with respect to zero
y_locs, y_labels = plt.yticks()
y_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in y_locs]
plt.yticks(y_locs, y_tick_names, fontsize = 12);
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.3 Distribution of bike stations traffic.png', dpi=300, bbox_inches='tight')
The above plot depicts that the bike rentals and returns follow a linear pattern. Also a majority of bike stations are clustered between (0 - 10K) bike returns and (0 - 5K) bike rentals.
Visual display of stations with realtively low activity:¶Plot the stations with low activity:
# Assign figure and color palette as per requirement
plt.figure(figsize=[18, 5])
sb.set_style('white')
sb.set_palette('deep', n_colors = 2, desat = 0.8)
# prepare the data for the subplots
low_traffic = stations[(stations['rentals'] < 10) & (stations['returns'] < 10)]
not_low_traffic = stations[~((stations['rentals'] < 10) & (stations['returns'] < 10))]
# left plot: dataset that has all entries
# -------------------------------------------------------
plt.subplot(1, 3, 1)
sb.regplot(data = not_low_traffic, x = 'rentals', y = 'returns', color = 'c',
fit_reg = False, scatter_kws = {'alpha' : 1/10});
sb.regplot(data = low_traffic, x = 'rentals', y = 'returns', color = 'orange',
fit_reg = False, scatter_kws = {'alpha' : 1/2});
# improve pot aesthetics
plt.title('Overall traffic\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('Bike Returns (thousands)\n', fontsize = 14)
# get xtick locs and rearrage them with respect to zero
x_locs, x_labels = plt.xticks()
x_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in x_locs]
plt.xticks(x_locs, x_tick_names, fontsize = 12)
# get ytick locs and rearrage them with respect to zero
y_locs, y_labels = plt.yticks()
y_tick_names = ['{:0.0f} K'.format(loc/1000) if (loc >= 0 and loc % 10000 == 0) else '' for loc in y_locs]
plt.yticks(y_locs, y_tick_names, fontsize = 12);
# -------------------------------------------------------
# middle plot: dataset that has entries under 120 minutes duration
# -------------------------------------------------------
plt.subplot(1, 3, 2)
ax = sb.regplot(data = not_low_traffic, x = 'rentals', y = 'returns', color = 'c',
fit_reg = False, scatter_kws = {'alpha' : 1/2});
sb.regplot(data = low_traffic, x = 'rentals', y = 'returns', color = 'orange',
fit_reg = False, scatter_kws = {'alpha' : 1/2}, ax = ax);
ax.set(xlim=(-10, 100))
ax.set(ylim=(-10, 100));
# improve pot aesthetics
plt.title('Traffic under 100\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
# -------------------------------------------------------
# right plot: dataset that has entries under 30 minutes duration
# -------------------------------------------------------
plt.subplot(1, 3, 3)
ax = sb.regplot(data = not_low_traffic, x = 'rentals', y = 'returns', color = 'c',
fit_reg = False, scatter_kws = {'alpha' : 1/2});
sb.regplot(data = low_traffic, x = 'rentals', y = 'returns', color = 'orange',
fit_reg = False, scatter_kws = {'alpha' : 1/2}, ax = ax);
ax.set(xlim=(-1, 10))
ax.set(ylim=(-1, 10));
# improve pot aesthetics
plt.title('Traffic under 10\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike Rentals (thousands)', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize = 12);
# -------------------------------------------------------
plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle("Distribution of bike station's traffic\n", fontsize = 18, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.10.4 Distribution of bike stations traffic.png', dpi=300, bbox_inches='tight')
In the above plot, the yellow markers represent the bike stations with very low rental activity (bike rentals and returns combined).
Observation:
The above plor depicts that there exist some stations with relatively low bike activity (rentals + returns) and deemed as high maintenance. The said stations does not even constitute to
10 bike activities(rentals and returns combined). Hence these stations are financially not suitable for further maintainance and need to beterminated/relocated.
Identification of the stations with low bike activity for further action:¶# extract the stations with low bike activity
low_activity = stations[(stations['rentals'] < 10) & (stations['returns'] < 10)]
low_activity
Display the list of the stations with low bike activity:
# display the id of the stations with low bike activity
print('Low activity bike stations:')
print('-'*27)
for i, station in enumerate(low_activity.station_id.values):
print('{}. Station ID: {}'.format(i+1, station))
10 bike activities over the span of 3 years.Column: start_station_id, end_station_id, start_time, end_timeData type: (categorical, ordinal), (categorical, ordinal), (numerical, continuous), (numerical, continuous)Plot : line plot, scatter plotLet us observe the aggregated hourly bike rentals and bike returns subjected to each hour in a day, to identify the gap between bike demand and supply.
Distribution of average bike rentals and bike returns over the hour of the day:¶# Assign grid and figure size
plt.figure(figsize = [8, 6])
sb.set_style('darkgrid')
# prepare the data for the plot
x1 = bikeshare.groupby(bikeshare['start_time'].dt.hour).count()['trip_id'].index
y1 = bikeshare.groupby(bikeshare['start_time'].dt.hour).count()['trip_id'].values
x2 = bikeshare.groupby(bikeshare['end_time'].dt.hour).count()['trip_id'].index
y2 = bikeshare.groupby(bikeshare['end_time'].dt.hour).count()['trip_id'].values
x_tick_values = np.arange(0, 23+1, 1)
x_tick_names = ['{:}'.format(v) for v in x_tick_values]
y_tick_values = np.arange(0, bikeshare.start_time.dt.hour.value_counts().max()+10000, 10000)
y_tick_names = ['{:0.0f} K'.format(v/1000) for v in y_tick_values]
# plot matplotlib's line plot
plt.plot(x1, y1, linewidth=2.0, color = 'lightskyblue', alpha = 0.5)
plt.plot(x2, y2, linewidth=2.0, color = 'orange', alpha = 0.5)
# improve plot aesthetics
plt.title('Distribution of hourly rentals and returns\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nHour of the day', fontsize = 14)
plt.ylabel('Count (thousands)\n', fontsize = 14)
plt.xticks(x_tick_values, x_tick_names, fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
plt.fill_between(x1, y1, color = 'lightskyblue', alpha = 0.5)
plt.fill_between(x2, y2, color = 'orange', alpha = 0.5)
# draw the vertical axial line at the peak hour
start_peak_hour = bikeshare['start_time'].dt.hour.value_counts(ascending=False).index[0]
plt.axvline(start_peak_hour, color='black', alpha=0.3, linewidth=2)
end_peak_hour = bikeshare['end_time'].dt.hour.value_counts(ascending=False).index[0]
plt.axvline(end_peak_hour, color='pink', alpha=0.3, linewidth=2);
# add custom legend
custom_lines = [Line2D([0], [0], color= 'lightskyblue', lw=2),
Line2D([0], [0], color= 'orange', lw=2)]
plt.legend(custom_lines, ['Rentals', 'Returns'], scatterpoints=1, frameon=True, fancybox=True, shadow=False,
ncol = 1, framealpha = 1, borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5,
title='Metro Bike', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.3, 1));
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.11.1 Distribution of hourly bike rentals and returns.png', dpi=300, bbox_inches='tight')
The above plot depicts that:
Early Hours (0:00 AM - 5:00 AM).Morning and Post Morning (5:00 AM - 13:00 PM).Evenings and Nights (14:00 PM - 23:00 PM).However, both bike rentals and bike returns are plotted over the aggregation of 3 years (2017 - 2019). Hence plot the average bike rentals and bike returns over the individual years for any hidden insights.
Distribution of average bike rentals and bike returns over the years:¶# create a dataset for the bike rentals for each hour in a respective day
start_df = bikeshare.groupby([bikeshare['start_time'].dt.year,
bikeshare['start_time'].dt.month,
bikeshare['start_time'].dt.day,
bikeshare['start_time'].dt.hour],as_index=False).size()
start_df = start_df.rename_axis(['year','month', 'day', 'hour']).reset_index(name='rentals')
start_df['rentals'] = start_df['rentals'].fillna(0).astype(int)
start_df.head()
# create a dataset for the bike returns for each hour in a respective day
end_df = bikeshare.groupby([bikeshare['end_time'].dt.year,
bikeshare['end_time'].dt.month,
bikeshare['end_time'].dt.day,
bikeshare['end_time'].dt.hour],as_index=False).size()
end_df = end_df.rename_axis(['year','month', 'day', 'hour']).reset_index(name='returns')
end_df['returns'] = end_df['returns'].fillna(0).astype(int)
end_df.head()
Point plot:
def point_subplot(subplot, year):
# plot the distribution of bike rentals based on category types
#-----------------------Start of subplot-----------------------
# prepare the data for the plot
sb.set_style('dark')
plt.subplot(1, 4, subplot)
start_year_df = start_df[ start_df['year'] == year ]
end_year_df = end_df[ end_df['year'] == year ]
#plot point plots for bike rentals and bike returns over the year
ax = sb.pointplot(data = start_year_df, x = "hour", y = "rentals", linestyles = "-",
color = 'lightskyblue', ci=None, markers = '')
ax = sb.pointplot(data = end_year_df, x = "hour", y = "returns", linestyles = "-",
color = 'orange', ci=None, ax =ax, markers = '')
# obtain the two lines from the axes to generate shading
l1 = ax.lines[0]
l2 = ax.lines[1]
# Get the xy data from the lines so that we can shade
x1 = l1.get_xydata()[:,0]
y1 = l1.get_xydata()[:,1]
x2 = l2.get_xydata()[:,0]
y2 = l2.get_xydata()[:,1]
# fill the area under the individual lines
ax.fill_between(x1,y1, color='lightskyblue', alpha=0.5)
ax.fill_between(x2,y2, color='orange', alpha=0.5);
# improve plot aesthetics
plt.title('Year = {}\n'.format(year), fontsize = 14, weight = 'bold', color = 'dimgrey')
plt.xlabel('\nHour of the day', fontsize = 14)
plt.ylabel('Count (thousands)\n', fontsize = 14)
plt.xticks(fontsize = 10)
locs, labels = plt.yticks()
if subplot == 1:
plt.ylabel('Count (thousands)\n', fontsize = 14)
plt.yticks(fontsize = 10)
else:
plt.ylabel('')
plt.yticks(locs, [])
return ax
#-------------------------End of subplot------------------------
# Assign grid and figure size
plt.figure(figsize = [24, 6])
sb.set_style('dark')
# plot subplots over years
ax1 = point_subplot(subplot = 1, year = 2017)
ax2 = point_subplot(subplot = 2, year = 2018)
ax3 = point_subplot(subplot = 3, year = 2019)
# adjust the plots to have the same y axis limits
if ax1.get_ylim()[1] < ax2.get_ylim()[1]:
if ax2.get_ylim()[1] < ax3.get_ylim()[1]:
ax1.set_ylim(ax3.get_ylim())
ax2.set_ylim(ax3.get_ylim())
else:
ax1.set_ylim(ax2.get_ylim())
ax3.set_ylim(ax2.get_ylim())
else:
if ax1.get_ylim()[1] < ax3.get_ylim()[1]:
ax1.set_ylim(ax3.get_ylim())
ax2.set_ylim(ax3.get_ylim())
else:
ax2.set_ylim(ax1.get_ylim())
ax3.set_ylim(ax1.get_ylim())
plt.subplots_adjust(wspace=0.05, hspace=0.3);
plt.subplots_adjust(top=0.8)
plt.suptitle('Distribution of average bike rentals and bike returns over the years\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.11.2 Average bike rentals and bike returns over years.png', dpi=300, bbox_inches='tight')
The above plot reinforces the previous observations on the distribution of bike demand and supply over the hour of the day. In the time span of 3 years, the only period of time where the bike supply falls short of demand is during Mornings (8:00 Am - 1:00 PM). However the gap is very lean and does not require any immediate attention.
6 hours between (8:00 AM - 14:00 PM) experiences a shortage of supply in bikes compared to demand in bikes by the customers. However the gap in supply and demand is very lean and does not require any immediate attention.Column: start_station_idData type: categorical, ordinalPlot : Distribution plot, pie chart, bar chartLogarithmic distribution of start_stations bike rentals:¶Calculate the respective bike rentals subjected to each start station.
# find the rentals based on start_station_id
start_stations = bikeshare.groupby([bikeshare['start_station_id']]).size().reset_index(name='rentals')
start_stations.head()
Explore the Logarithmic distribution of start stations bike rentals:
def log_trans(x, inverse = False):
if not inverse:
return np.log10(x)
else:
return 10 ** x
sb.set_style('white')
# prepare the data for the plot
min_value = log_trans(start_stations['rentals'].min())
max_value = log_trans(start_stations['rentals'].max())
bin_edges = np.arange(0, max_value+0.1, 0.1)
# matplotlib's histogram
plt.hist(start_stations['rentals'].apply(log_trans), bins = bin_edges, color = 'darkturquoise')
# improve plot aesthetics
plt.title('Logarithmic distribution of start stations bike rentals\n', fontsize = 14, weight = 'bold')
plt.xlabel('\nNumber of bike rentals', fontsize = 12)
plt.ylabel('Number of Start stations\n', fontsize = 12);
tick_locs = np.arange(0, max_value+1, 1)
plt.xticks(tick_locs, log_trans(tick_locs, inverse = True).astype(int), fontsize = 10)
plt.yticks(fontsize = 10)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.1 Logarithmic distribution of start stations bike rentals.png', dpi=300, bbox_inches='tight')
Breakdown the bike rental traffic at the start stations based on the above plot.
Classification of start_stations based on their rental traffic:¶Create a dataframe based on bike rentals traffic and number of start stations associated with them.
rentals = {'rental_traffic' : pd.Series(['Very Low', 'Low', 'Normal', 'High', 'Very High']),
'start_stations' : pd.Series([start_stations.query(' rentals < 10 ').shape[0],
start_stations.query(' rentals >= 10 and rentals < 100 ').shape[0],
start_stations.query(' rentals >= 100 and rentals < 1000 ').shape[0],
start_stations.query(' rentals >= 1000 and rentals < 10000 ').shape[0],
start_stations.query(' rentals >= 10000 ').shape[0]])}
# create the Dataframe.
bike_rentals = pd.DataFrame(rentals)
bike_rentals
Plot the distribution of start stations bike rentals traffic.
def absolute_value(val):
'''returns absolute count of start statioins to plot in
the pie chart as annotations using the auto_pct function'''
a = np.round(val/100.*type_level_counts.sum(), 0)
return int(a)
# Assign grid and figure size
plt.figure(figsize = [12, 5])
sb.set_style('white')
# left plot: Pie chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 1)
# prepare the data for the plot
type_level_counts = bike_rentals.start_stations.values
type_level_index = bike_rentals.rental_traffic.values
explode = (0.2, 0, 0, 0, 0)
colors = ['paleturquoise', 'darkturquoise', 'darkturquoise', 'darkturquoise', 'darkturquoise']
# matplotlib's pie chart
plt.pie(type_level_counts, labels = type_level_index, startangle = 90,
counterclock = False, wedgeprops = {'width' : 0.4}, shadow=False,
explode=explode, colors=colors, textprops={'fontsize': 12},
autopct='%1.0f%%', labeldistance=1.1, pctdistance=0.8)
plt.title('Percent of Stations\n\n', fontsize = 14, weight = 'bold', color = 'grey')
plt.axis('square');
# =====================================================
# /////////////////////////////////////////////////////
# right plot: Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 2)
# prepare the data for the plot
counts = bike_rentals.start_stations.values
order = bike_rentals.start_stations.index
y_locs = [0, 1, 2, 3, 4]
y_labels = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
clrs = [ 'darkturquoise' if (x > bike_rentals.start_stations.values.min()) else 'paleturquoise' for x in counts ]
# seaborn's bar plot
sb.barplot(x = counts, y = order, palette=clrs, alpha= 1, saturation = 0.8, orient = 'h')
# improve plot aesthetics
plt.title('Number of Stations\n\n', weight = 'bold', fontsize = 14, color = 'grey')
plt.yticks(y_locs, y_labels, rotation = 0, fontsize = 12)
plt.xticks([], [], rotation = 0, fontsize = 12)
plt.xlabel('', fontsize = 14)
plt.ylabel('', fontsize = 14)
# add annotations
# -------------------------------------------------------
# loop through each pair of locations and labels
for loc, count in zip(y_locs, counts):
pct_string = '{:0.0f}'.format(count)
# print the annotation based on bar length
if count < int(max(counts)/10):
plt.text(count+int(max(counts)/25), loc+0.1, pct_string, ha = 'center', color = 'black', weight = 'bold', fontsize = 13)
else:
plt.text(count-int(max(counts)/15), loc+0.1, pct_string, ha = 'center', color = 'white', fontsize = 13)
# -------------------------------------------------------
sb.despine(fig=None, ax=None, top=True, right=True, left=True, bottom=True, offset=None, trim=False);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Classification of start stations based on Rental Traffic\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.2 Classification of start stations based on Rental Traffic.png', dpi=300, bbox_inches='tight')
The above plot depicts that there exists start stations with very low bike rental activity. This denotes that the bike rental traffic is not equally distributed over the start stations.
However this does not imply that these start stations are to be eliminated as they might incur good bike return traffic and still prove to be a station that procure acceptable business metrics.
Deeper exploration of start stations rental behaviour for hidden insights:
Distribution of start stations rental traffic categorized by trip type:¶Obtain the rentals subjected to each start station categorized over trip type:
# create a dataframe with start stations rentals over trip type
start_stations = bikeshare.groupby([bikeshare['start_station_id'],
bikeshare['trip_type']]).size().reset_index(name='rentals')
start_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['trip_type'] = start_stations['trip_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'trip_type', 'rental_bins'])
category.head()
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Label the rental bins:
%%time
def label_race(df):
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
label_race(df)
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize start stations over rental traffic and trip_type:
# prepare a dataframe to categorize start stations over rental traffic and trip_type
temp_df = category.groupby([category['traffic'], category['trip_type']]).size().reset_index(name='start_stations')
temp_df
Data Dashboard:
plot the distribution of Start station traffic based on trip type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plots based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except IndexError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
sb.set_palette('GnBu', n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'trip_type', hue = 'traffic', alpha = 0.8, saturation = 0.8)
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.85, 0.8), loc = 6, labelspacing=0.5,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 12), textcoords = 'offset points', fontsize = 12)
# plot vertical axial lines for categorical separation
separators = [0.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
flatui = ['darkturquoise', 'paleturquoise']
sb.set_palette(flatui, desat = 0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'trip_type', alpha = 0.8, saturation = 0.8)
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 12), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'trip_type', y = 'start_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nTrip Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('trip_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))
sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
flatui = ['darkturquoise', 'paleturquoise']
sb.set_palette(flatui, desat = 0.6)
# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'trip_type');
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'trip_type', 'start_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.3 Start stations rental traffic categorized by trip type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that the
8start stations constituteVery Lowrental traffic and57start stations constituteLowrental traffic subjected toOne Waytrips, while22start stations experienceVery Lowrental traffic and96start stations constituteLowrental traffic subjected toRound Trips. This denotes that the more number of stations experiencelowandvery lowrental traffic subjected toRound Trips.- Also the number of start stations that experience
HighandVery highbike rental traffic forRound Tripsis less than that ofOne Waytrips.- This reveals the need of improving rental traffic in start stations subjected to
Round Trips.
Reform:
- Discounts or promotions should be announced for
Round Tripsat start stations which experiencesLowandVery Lowrental traffic, to encourage customers to rent bikes from these particular staions.
Distribution of start stations rental traffic categorized by bike type:¶Obtain the rentals subjected to each start station categorized over bike type:
# create a dataframe with start stations rentals over bike type
start_stations = bikeshare.groupby([bikeshare['start_station_id'],
bikeshare['bike_type']]).size().reset_index(name='rentals')
start_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['bike_type'] = start_stations['bike_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'bike_type', 'rental_bins'])
category.head()
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Label the rental bins:
%%time
def label_race(df):
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
label_race(df)
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize start stations over rental traffic and bike_type:
# prepare a dataframe to categorize start stations over rental traffic and bike_type
temp_df = category.groupby([category['traffic'], category['bike_type']]).size().reset_index(name='start_stations')
temp_df
Data Dashboard:
plot the distribution of Start station traffic based on bike type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plots based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except IndexError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
sb.set_palette('GnBu', n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = df, x = 'bike_type', hue = 'traffic', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.75, 1.1), loc = 'upper left', labelspacing=0.5,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'bike_type', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.8), loc = 6, labelspacing=0.5,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
# sb.set_palette(flatui, n_colors=5, desat=0.8)
sb.set_palette('GnBu', n_colors=5, desat=0.6)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'bike_type', y = 'start_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax1.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.8, 1.15))
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)
# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'bike_type');
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.4 Start stations rental traffic categorized by bike type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that there are very small number of start stations subjected to bike types, that experience
Very Lowbike rental activity, which is a good sign for healthy business model. However, the number of start stations that experinceVery Highrental activity is also very small. This limits the start stations from utilizing its full potential.- As the number of start stations that experiences
LowandVery Lowbike rental activity are clustered closely, this represents that the bike type has no influence on the rental activity at these stations and there might be other factors that caused the decrease in bike rental activity at these stations. So no action subjected to bike type is required to increase the bike rental activity at start stations withLowandVery Lowrental activity.- Smart bike has less number of start stations with
NormalandHighrental traffic compared to other bike types.
Reform:
- Discounts or promotions should be announced for
Round Tripsat start stations which experiencesLowandVery Lowrental traffic, to encourage customers to rent bikes from these particular staions.- Annouce promotions on
Smartbikes to increase their rental activity at start stations.
Distribution of start stations rental traffic categorized by pass type:¶Obtain the rentals subjected to each start station categorized over pass type:
# create a dataframe with start stations rentals over pass type
start_stations = bikeshare.groupby([bikeshare['start_station_id'],
bikeshare['pass_type']]).size().reset_index(name='rentals')
start_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['pass_type'] = start_stations['pass_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'pass_type', 'rental_bins'])
category.head()
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Label the rental bins:
%%time
def label_race(df):
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
label_race(df)
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize start stations over rental traffic and pass_type.
# prepare a dataframe to categorize start stations over rental traffic and pass_type
temp_df = category.groupby([category['traffic'],
category['pass_type']]).count()['start_station_id'].reset_index(name='start_stations')
temp_df.head(10)
Data Dashboard:
Plot the distribution of Start station traffic based on pass type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plot based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except KeyError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
sb.set_palette('GnBu', n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = df, x = 'pass_type', hue = 'traffic', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of Start stations\n\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (-0.08, 1.15), loc = 'upper left', labelspacing=0.5,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'pass_type', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette('GnBu', n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'pass_type', y = 'start_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax1.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0, 1.15))
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign palette as per requirement
flatui = ["#26bda7", "#9b59b6", "#3498db", "#e74c3c", "#34495e"]
sb.set_palette(flatui)
# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'pass_type');
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.5 Start stations rental traffic categorized by pass type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that a major number of start stations subjected to
Annualpass experiences eitherLoworVery Lowrental traffic. AsAnnual passis a long-term subscription, this behaviour is to be expected.- All the start stations subjected to
Flexpass are compromised intoLowandVery Lowbike rental traffic. This is becauseFlexpass is originally issued for testing puspose for employees. Hence this insight is ignored.- Bike rentals subjected to
Walk-uppass are ignored as it is discontinued after the year 2018.- It appears that a fair number of start stations experience
Lowrental traffic subjected toMonthlypass type. AsMonthlypass is mostly preferred by working individuals, any action taken will not encourage them to switch their start station preference, which might result in delay to their ride to work. Hence no action is required to increase rental activity subjected toMonthlypass type at start stations withLowrentnal activity.- There exists many start stations with relatively
Lowbike rental activity, subjected toOne Daypass.
Reform:
- Promotions should be announced to increase the rental traffic subjected to
One Daypasses at start stations withLowbike rental activity.
Distribution of start stations rental traffic categorized by fare type:¶Obtain the rentals subjected to each start station categorized over fare type:
# create a dataframe with start stations rentals over fare type
start_stations = bikeshare.groupby([bikeshare['start_station_id'],
bikeshare['fare_type']]).size().reset_index(name='rentals')
start_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(start_stations['rentals'],bin)
category = category.to_frame()
category.columns = ['rental_bins']
category['fare_type'] = start_stations['fare_type']
category['start_station_id'] = start_stations['start_station_id']
category = category.reindex(columns=['start_station_id', 'fare_type', 'rental_bins'])
category.head()
# obtain the unique categorical rental bins
category.rental_bins.sort_values(ascending=True).unique()
Label the rental bins:
%%time
def label_race(df):
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['rental_bins'] == df.rental_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
label_race(df)
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize start stations over rental traffic and fare_type.
# prepare a dataframe to categorize start stations over rental traffic and fare_type
temp_df = category.groupby([category['traffic'],
category['fare_type']]).count()['start_station_id'].reset_index(name='start_stations')
temp_df.head(10)
Data Dashboard:
Plot the distribution of Start station traffic based on fare type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plots based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except IndexError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
sb.set_palette('GnBu', n_colors = 5, desat = 0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'fare_type', hue = 'traffic', alpha = 0.8, saturation = 0.8)
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.85, 0.8), loc = 6, labelspacing=0.5,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
separators = [0.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
flatui = ['#fcd605', '#fae887']
sb.set_palette(flatui, desat = 0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'fare_type', alpha = 0.8, saturation = 0.8)
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Fare type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'fare_type', y = 'start_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of Start stations rental traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of Start stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('fare_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))
sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign palette as per requirement
flatui = ['#fcd605', '#fae887']
sb.set_palette(flatui, desat = 0.6)
# Seaborn's pointplot
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'start_stations', hue = 'fare_type');
# improve plot aesthetics
plt.title('Classification of Start stations by rental traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Fare type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'fare_type', 'end_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.12.6 Start stations rental traffic categorized by fare type.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that a major number of start stations subjected to
Extended faretypes experienceLowrental traffic. Action should be taken to encourage customers to take longer trips at these stations to increase income generation.- Also a major number of start stations subjected to
Base fareexperienceNormalandHigherrental traffic. This is a good sign for healthy business model with higher average in bike rental traffic over time.
Reform:
- Promotions should be announced to encourage customers to take longer trips at the stations with
Lowbike rental traffic subjected toExtended Fares, to increase income generation.
Round Trips experience Low and Very Low rental traffic. This reveals the need of improving rental traffic at the start stations subjected to Round Trips.High and Very high bike rental traffic for Round Trips is less than that of One Way trips. This denotes that One Way trips are more popular among the customers.Very Low bike rental activity, which is a good sign for healthy business model. However, the number of start stations that experince Very High rental activity is also very small. This limits the usage of start stations from serving its full potential. Low and Very Low bike rental activity subjected to bike type are clustered closely, this unravels that the bike type has no influence on the rental activity at these stations and there might be other factors that caused the decrease in bike rental activity at these specitic stations. So no action subjected to bike type is required to increase the bike rental activity at start stations with Low and Very Low rental activity.Smart bike has less number of start stations with Normal and High rental traffic compared to other bike types. This reflects that Smart bikes requires more advertisement and awareness among customers.Annual pass experiences either Low or Very Low rental traffic. As Annual pass is a long-term subscription, this behaviour is to be expected.Flex pass are compromised into Low and Very Low bike rental traffic. This is because Flex pass is originally issued for testing puspose for employees. Hence this insight is ignored.Low rental traffic subjected to Monthly pass type. As Monthly pass is mostly preferred by working individuals, any action taken will not encourage them to switch their start station preference, which might result in delay to their ride to work. Hence no action is required to increase rental activity subjected to Monthly pass type at start stations with Low rentnal activity.Low bike rental activity, subjected to One Day pass. This might be due to the influence of its geographical location or acquisition of bike rentals related to other customer pass types.Extended fare types experience Low rental traffic. Action should be taken to encourage customers to take longer trips at these stations to increase income generation.Base fare experience Normal and Higher rental traffic. This is a good sign for healthy business model with higher average in bike rental traffic over time.Round Trips to encourage customers to rent bikes from the staions with Low and Very Low bike rental activity subjected to Round Trips.Smart bikes to increase their rental activity at start stations.One Day passes at start stations with Low bike rental activity.Low bike rental traffic subjected to Extended Fares, to increase income generation.Column: end_station_idData type: categorical, ordinalPlot : Distribution plot, pie chart, bar chartAs the bike rentals and bike returns follow a linear relation, it is to be taken into account through out the end stations analysis.
Logarithmic distribution of start_stations bike rentals:¶Calculate the respective bike returns subjected to each end station.
# find the bike returns based on end_station_id
end_stations = bikeshare.groupby([bikeshare['end_station_id']]).size().reset_index(name='returns')
end_stations.head()
Explore the Logarithmic distribution of end stations bike returns:
def log_trans(x, inverse = False):
if not inverse:
return np.log10(x)
else:
return 10 ** x
sb.set_style('white')
# prepare the data for the plot
min_value = log_trans(end_stations['returns'].min())
max_value = log_trans(end_stations['returns'].max())
bin_edges = np.arange(0, max_value+0.1, 0.1)
# matplotlib's histogram
plt.hist(end_stations['returns'].apply(log_trans), bins = bin_edges, color = 'salmon')
# improve plot aesthetics
plt.title('Logarithmic distribution of end stations bike returns\n', fontsize = 14, weight = 'bold')
plt.xlabel('\nNumber of bike returns', fontsize = 12)
plt.ylabel('Number of End stations\n', fontsize = 12);
tick_locs = np.arange(0, max_value+1, 1)
plt.xticks(tick_locs, log_trans(tick_locs, inverse = True).astype(int), fontsize = 10)
plt.yticks(fontsize = 10)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.1 Logarithmic distribution of end stations bike returns.png', dpi=300, bbox_inches='tight')
Breakdown the bike return traffic at the End stations based on the above plot.
Classification of end_stations based on their bike return traffic:¶Create a dataframe based on bike return traffic and number of end stations associated with them.
returns = {'return_traffic' : pd.Series(['Very Low', 'Low', 'Normal', 'High', 'Very High']),
'end_stations' : pd.Series([end_stations.query(' returns < 10 ').shape[0],
end_stations.query(' returns >= 10 and returns < 100 ').shape[0],
end_stations.query(' returns >= 100 and returns < 1000 ').shape[0],
end_stations.query(' returns >= 1000 and returns < 10000 ').shape[0],
end_stations.query(' returns >= 10000 ').shape[0]])}
# create Dataframe.
bike_returns = pd.DataFrame(returns)
bike_returns
Plot the distribution of end stations bike return traffic.
def absolute_value(val):
'''returns absolute count of end statioins to plot in the
pie chart as annotations using the auto_pct function'''
a = np.round(val/100.*type_level_counts.sum(), 0)
return int(a)
# Assign grid and figure size
plt.figure(figsize = [12, 5])
sb.set_style('white')
# left plot: Pie chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 1)
# prepare the data for the plot
type_level_counts = bike_returns.end_stations.values
type_level_index = bike_returns.return_traffic.values
explode = (0.2, 0, 0, 0, 0)
colors = ['bisque', 'salmon', 'salmon', 'salmon', 'salmon']
# matplotlib's pie chart
plt.pie(type_level_counts, labels = type_level_index, startangle = 90,
counterclock = False, wedgeprops = {'width' : 0.4}, shadow=False,
explode=explode, colors=colors, textprops={'fontsize': 12},
autopct='%1.0f%%', labeldistance=1.1, pctdistance=0.8)
plt.title('Percent of Stations\n\n', fontsize = 14, weight = 'bold', color = 'grey')
plt.axis('square');
# =====================================================
# /////////////////////////////////////////////////////
# right plot: Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 2)
# Assign grid and color palette as per requirement
base_color = sb.color_palette()[0]
# prepare the data for the plot
counts = bike_returns.end_stations.values
order = bike_returns.end_stations.index
y_locs = [0, 1, 2, 3, 4]
y_labels = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
clrs = [ 'salmon' if (count > bike_returns.end_stations.values.min()) else 'bisque' for count in counts ]
# Seaborn's bar chart
sb.barplot(x = counts, y = order, palette=clrs, alpha= 1, saturation = 0.8, orient = 'h')
# improve plot aesthetics
plt.title('Number of Stations\n\n', weight = 'bold', fontsize = 16, color = 'grey')
plt.yticks(y_locs, y_labels, rotation = 0, fontsize = 12)
plt.xticks([], [], rotation = 0, fontsize = 12)
plt.xlabel('', fontsize = 14)
# plt.ylabel('Number of Stations', fontsize = 14)
# add annotations
# -------------------------------------------------------
# loop through each pair of locations and labels
for loc, count in zip(y_locs, counts):
pct_string = '{:0.0f}'.format(count)
# print the annotation based on bar length
if count <= int(max(counts)/10):
plt.text(count+int(max(counts)/20), loc, pct_string, ha = 'center', color = 'black', fontsize = 13)
else:
plt.text(count-int(max(counts)/10), loc, pct_string, ha = 'center', color = 'white', fontsize = 13)
# -------------------------------------------------------
sb.despine(fig=None, ax=None, top=True, right=True, left=True, bottom=True, offset=None, trim=False);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Classification of End stations based on Bike Return Traffic\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.2 Classification of End stations based on Return Traffic.png', dpi=300, bbox_inches='tight')
The above plot depicts that there exists end stations with very low bike return activity. This denotes that the bike return traffic is not equally distributed over the end stations.
However this does not imply that these end stations are to be eliminated as they might incur good bike rental traffic and still prove to be a station that procure acceptable business metrics.
Deeper exploration of end stations return behaviour for hidden insights:
Explore the distribution of end stations rental traffic categorized by trip type:¶Obtain the bike returns subjected to each end station categorized over trip type:
# create a dataframe with end stations returns over trip type
end_stations = bikeshare.groupby([bikeshare['end_station_id'],
bikeshare['trip_type']]).size().reset_index(name='returns')
end_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['trip_type'] = end_stations['trip_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'trip_type', 'return_bins'])
category.head()
# obtain the unique categorical return bins
category.return_bins.sort_values(ascending=True).unique()
Label the return bins:
%%time
def assign_traffic(df):
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
assign_traffic(df)
# convert the 'traffic' column to ordered categorical datatype
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize end stations over bike return traffic and trip_type:
# prepare a dataframe to categorize end stations over return traffic and trip_type
temp_df = category.groupby([category['traffic'], category['trip_type']]).size().reset_index(name='end_stations')
temp_df
Data Dashboard:
Plot the distribution of End station return traffic based on trip type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plots based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except IndexError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'trip_type', hue = 'traffic', alpha = 0.8, saturation = 0.9)
# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.8), loc = 6, labelspacing=0.5,
title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
separators = [0.5, 1.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
flatui = ['salmon', 'bisque']
sb.set_palette(flatui, desat = 0.8)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'trip_type', alpha = 0.8, saturation = 0.9)
# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.8), loc = 6, labelspacing=0.5,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
# flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
# sb.set_palette(flatui, n_colors=5, desat=0.6)
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'trip_type', y = 'end_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nTrip Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('trip_type', 'traffic', 'end_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))
sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign color palette
flatui = ['salmon', 'bisque']
sb.set_palette(flatui, desat = 0.8)
# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'trip_type');
# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & trip type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'trip_type', 'start_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.3 End stations return traffic categorized by trip type.png', dpi=300, bbox_inches='tight')
Observations:
- Many end stations subjected to
One Waytrips experience aLowbike return traffic. WhenRound Tripsare involved, the bike rental and bike returns are subjected to one and same bike station. Hence when a station experience aLowbike return traffic, it also implies aLowbike rental traffic subjected to the same station. However, unlikeRound Trips, having aLowbike returns subjected toOne Waytrips will create a deficit in the availabitiy of the bikes in the specific station. Hence an immediate action is required to redirect bikes fromstations with High and Very Highbike return traffic tostations with Low and Very Lowbike return traffic subjected toOne Waytrips to normalize the availability of bikes over all staions.- Also a major number of end stations subjected to
Round Trips experienceLowbike return traffic.
Reform:
- Promotions should be announced to encourage customers to opt for the
Round Tripsat the end stations withLowbike return traffic.
Explore the distribution of end stations rental traffic categorized by bike type:¶Obtain the rentals subjected to each end station categorized over bike type:
# create a dataframe with end stations rentals over bike type
end_stations = bikeshare.groupby([bikeshare['end_station_id'],
bikeshare['bike_type']]).size().reset_index(name='returns')
end_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['bike_type'] = end_stations['bike_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'bike_type', 'return_bins'])
category.head()
# obtain the unique categorical return bins
category.return_bins.sort_values(ascending=True).unique()
Label the return bins:
%%time
def assign_traffic(df):
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
assign_traffic(df)
# convert 'traffic' column to ordered categorical datatype
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize end stations over return traffic and bike_type:
# prepare a dataframe to categorize end stations over return traffic and bike_type
temp_df = category.groupby([category['traffic'], category['bike_type']]).size().reset_index(name='end_stations')
temp_df
Data Dashboard:
plot the distribution of End station traffic based on bike type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plots based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except IndexError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'bike_type', hue = 'traffic', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('Number of End stations\n\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
# -------------------------------------------------------
separators = [0.5, 1.5, 2.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'bike_type', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nReturn traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points')
# -------------------------------------------------------
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'bike_type', y = 'end_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax1.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Rental Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.8, 1.15))
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign palette as per requirement
sb.set_palette('deep', n_colors=4, desat=0.6)
# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'bike_type');
# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & bike type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Rental traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.4 End stations bike return traffic categorized by bike type.png', dpi=300, bbox_inches='tight')
Observation:
The above plot depicts that there are number of end stations that experiences
LowandVery Lowbike return activity and are clustered closely together. This represents that the bike type has no influence on the return activity at these stations. This is because of the bike returns in correlation with the bike rentals. So no action subjected to bike type is required to increase the bike return activity at the end stations withLowandVery Lowbike return activity.
Reform:
Smartbike has less number of end stations withNormalandHighbike return traffic compared to other bike types. This is because Smart bikes has less number of start stations withNormalandHighbike rental traffic compared to other bike types. However some stations might incur less bike returns than bike rentals subjected toSmartbikes and this scenario will result in smart bike deficiency at these partivular stations. Hence the smart bikes returns should be adjusted/redirected to match the bike demand in the specific end stations.
Distribution of start stations rental traffic categorized by pass type:¶Obtain the rentals subjected to each end station categorized over pass type:
# create a dataframe with end stations returns over pass type
end_stations = bikeshare.groupby([bikeshare['end_station_id'],
bikeshare['pass_type']]).size().reset_index(name='returns')
end_stations.head()
Categorize the rental traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['pass_type'] = end_stations['pass_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'pass_type', 'return_bins'])
category.head()
# obtain the unique categorical rental bins
category.return_bins.sort_values(ascending=True).unique()
Label the return bins:
%%time
def label_race(df):
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
label_race(df)
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize end stations over rental traffic and pass_type.
# prepare a dataframe to categorize end stations over rental traffic and pass_type
temp_df = category.groupby([category['traffic'],
category['pass_type']]).size().reset_index(name='end_stations')
temp_df.head(10)
Data Dashboard:
Plot the distribution of End station traffic based on pass type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plot based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except KeyError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = df, x = 'pass_type', hue = 'traffic', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Distribution of End stations return traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of End stations\n\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.9, 1.15), loc = 'upper left', labelspacing=0.5,
title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
flatui = ['indianred', 'lightcoral', 'darksalmon', 'salmon', 'lightsalmon']
sb.set_palette('deep', n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'pass_type', alpha = 0.8, saturation = 1)
# improve plot aesthetics
plt.title('Classification of End stations by return traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(top=True, left=True, right=True, bottom=False)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
separators = [0.5, 1.5, 2.5, 3.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'pass_type', y = 'end_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of End stations rental traffic over pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nPass Type', fontsize = 14)
plt.ylabel('Number of end stations\n', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax1.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('pass_type', 'traffic', 'start_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.9, 1.15))
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign palette as per requirement
flatui = ['indianred', 'lightcoral', 'darksalmon', 'salmon', 'lightsalmon']
sb.set_palette('deep', n_colors=5, desat=0.6)
# plot clustered bar chart
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'pass_type');
# improve plot aesthetics
plt.title('Classification of End stations by rental traffic & pass type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [0, 0, 0, 0, 0]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'pass_type', 'start_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.5 End stations bike return traffic categorized by pass type.png', dpi=300, bbox_inches='tight')
Observations:
- Bike returns subjected to
Annualpass type has high number of end stations withLowandVery Lowreturn traffic. This is because, theAnnualpass has high number of stations withLowandVery Lowrental activity.- There exists many end stations with relatively
Lowbike rental activity, subjected toOne Daypass.
Reform:
- As there exists many end stations with relatively
Lowbike rental activity subjected toOne Daypass, reforms should be taken to redirect the bike returns from other stations to match the demand for bikes at these end stations.
Distribution of End stations bike return traffic categorized by fare type:¶Obtain the rentals subjected to each end station categorized over fare type:
# create a dataframe with end stations rentals over fare type
end_stations = bikeshare.groupby([bikeshare['end_station_id'],
bikeshare['fare_type']]).size().reset_index(name='returns')
end_stations.head()
Categorize the return traffic values into categorical sections:
#use pd.cut function can attribute the values into its specific bins
bin = [0,10,100,1000,10000,100000]
category = pd.cut(end_stations['returns'],bin)
category = category.to_frame()
category.columns = ['return_bins']
category['fare_type'] = end_stations['fare_type']
category['end_station_id'] = end_stations['end_station_id']
category = category.reindex(columns=['end_station_id', 'fare_type', 'return_bins'])
category.head()
# obtain the unique categorical return bins
category.return_bins.sort_values(ascending=True).unique()
Label the return bins:
%%time
def label_race(df):
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[0],'traffic'] = 'Very Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[1],'traffic'] = 'Low'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[2],'traffic'] = 'Normal'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[3],'traffic'] = 'High'
df.loc[df['return_bins'] == df.return_bins.sort_values(ascending=True).unique()[4],'traffic'] = 'Very High'
df = category
label_race(df)
level_order = ['Very Low', 'Low', 'Normal', 'High', 'Very High']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
df['traffic'] = df['traffic'].astype(ordered_cat)
category.traffic.value_counts()
Prepare a dataframe to categorize end stations over return traffic and fare_type.
# prepare a dataframe to categorize end stations over return traffic and fare_type
temp_df = category.groupby([category['traffic'],
category['fare_type']]).count()['end_station_id'].reset_index(name='end_stations')
temp_df.head(10)
Data Dashboard:
Plot the distribution of End station traffic based on fare type:
def add_annotations(category_var, hue_var, counts_var, x_annotations, y_annotations, alignments):
'''add custom annotations to the plots based on hue and category'''
labels = temp_df[category_var].sort_values(ascending=True).unique()
hues = temp_df[hue_var].sort_values(ascending=True).unique()
for loc, var in enumerate(hues):
cat_counts = temp_df[temp_df[hue_var] == var][counts_var].values
for i, label in enumerate(labels):
try:
pct_string = '{:0.0f}'.format(cat_counts[i])
plt.text(i + x_annotations[i], cat_counts[i] + y_annotations[i], pct_string,
ha = alignments[i], color = 'black', fontsize = 13,
bbox=dict(pad=1.9,alpha=0.2,color='none',fc=sb.color_palette()[loc]))
except IndexError:
continue
# Assign grid and figure size
fig, axes = plt.subplots(figsize=(16,11), nrows=2, ncols=2) # grid of 3x4 subplots
axes = axes.flatten() # reshape from 3x4 array into 12-element vector
sb.set_style('white')
# Plot 0 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[0])
# Assign palette as per requirement
flatui = ['bisque', 'lightsalmon', 'darksalmon', 'salmon', 'tomato']
sb.set_palette(flatui, n_colors=5, desat=0.6)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'fare_type', hue = 'traffic', alpha = 0.8, saturation = 0.8)
# improve plot aesthetics
plt.title('Distribution of End stations return traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.85, 0.8), loc = 6, labelspacing=0.5,
title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
separators = [0.5]
for loc in separators:
plt.axvline(loc, ls='--', color='grey', linewidth=1, alpha=0.4);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 1 : Clustered Bar chart
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[1])
# Assign palette as per requirement
flatui = ['#cc2e88', '#fa98d0']
sb.set_palette(flatui, desat = 0.8)
# plot clustered bar chart
g = sb.countplot(data = category, x = 'traffic', hue = 'fare_type', alpha = 0.8, saturation = 0.8)
# improve plot aesthetics
plt.title('Classification of End stations by return traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
sb.despine(left=True)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Fare type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
for p in g.patches:
g.annotate(format(p.get_height(), '.0f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# =====================================================
# /////////////////////////////////////////////////////
# Plot 2 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[2])
# Assign color palette
flatui = ['#d62d6e', '#d6b42d', '#2dd6ce', '#2da6d6', '#395091']
sb.set_palette(flatui, n_colors=5, desat=0.8)
# Seaborn's ploint plot
ax1 = sb.pointplot(data = temp_df, x = 'fare_type', y = 'end_stations', hue = 'traffic');
# improve plot aesthetics
plt.title('Distribution of End stations return traffic over fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nFare Type', fontsize = 14)
plt.ylabel('Number of End stations\n', fontsize = 14)
plt.yticks([], fontsize = 12)
plt.xticks(fontsize = 12)
# add annotations
x_annotations = [-0.05, 0.05]
y_annotations = [0, 0]
alignments = ['right', 'left']
add_annotations('fare_type', 'traffic', 'end_stations', x_annotations, y_annotations, alignments)
# add legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Return Traffic', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(0.85, 1))
sb.despine(top=True, bottom=True, left=True, right=True, ax = ax1);
# =====================================================
# /////////////////////////////////////////////////////
# Plot 3 : Point plot
# =====================================================
# /////////////////////////////////////////////////////
plt.sca(axes[3])
# Assign palette as per requirement
flatui = ['#cc2e88', '#fa98d0']
sb.set_palette(flatui, desat = 0.8)
# Seaborn's pointplot
ax2 = sb.pointplot(data = temp_df, x = 'traffic', y = 'end_stations', hue = 'fare_type');
# improve plot aesthetics
plt.title('Classification of End stations by return traffic & fare type\n\n', fontsize = 16, weight = 'bold')
plt.xlabel('\nBike Return traffic', fontsize = 14)
plt.ylabel('', fontsize = 14)
plt.yticks(fontsize = 12)
plt.xticks(fontsize = 12)
# modify characteristics of each line
for i, line in enumerate(ax2.lines):
line.set_markevery(every=None)
line.set_marker('o')
line.set_markersize(8)
line.set_markeredgewidth(2)
line.set_markerfacecolor('#ffffff')
try:
base_color = sb.color_palette()[i]
line.set_markeredgecolor(base_color)
except IndexError:
continue
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (0.8, 0.9), loc = 6, labelspacing=0.5,
title='Fare type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
x_annotations = [0, 0, 0, 0, 0]
y_annotations = [6, 6, 6, 6, 6]
alignments = ['center', 'center', 'center', 'center', 'center']
# add_annotations('traffic', 'fare_type', 'end_stations', x_annotations, y_annotations, alignments)
sb.despine(top=True, bottom=False, left=False, right=True, ax = ax2);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.7)
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.13.6 End stations return traffic categorized by fare type.png', dpi=300, bbox_inches='tight')
Observations:
- The bike returns subjected to
Extended Faresincur a high number of end stations withLowreturn traffic and less number of stations withHighreturn traffic. This denotes thatExtended Faresare less desired by the customers.- The bike returns subjected to
Base Faresincur a high number of end stations withHighreturn traffic and less number of stations withLowreturn traffic. This denotes thatBase Faresare more preferred by the customers.
Reform:
- Actions should be taken to encourage the customers to ride the bikes for longer durations to incur
Extended Faresthus generating more income to the company.
Very Low bike return activity. However this does not imply that these end stations are to be eliminated as they might incur good bike rental traffic and still prove to be a station that procure acceptable business metrics.One Way trips experience a Low bike return traffic. When Round Trips are involved, the bike rental and bike returns are subjected to one and same bike station. Hence when a station experience a Low bike return traffic, it also implies a Low bike rental traffic subjected to the same station. However, unlike Round Trips, having a Low bike returns subjected to One Way trips will create a deficit in the availabitiy of the bikes in the specific station. Hence an immediate action is required to redirect bikes from stations with High and Very High bike return traffic to stations with Low and Very Low bike return traffic subjected to One Way trips to normalize the availability of bikes over all staions.Round Trips experience Low bike return traffic.Low and Very Low bike return activity and are clustered closely together. This represents that the bike type has no influence on the return activity at these stations. This is because of the bike returns in correlation with the bike rentals. So no action subjected to bike type is required to increase the bike return activity at the end stations with Low and Very Low bike return activity.Annual pass type has high number of end stations with Low and Very Low return traffic. This is because, the Annual pass has high number of stations with Low and Very Low rental activity.Low bike rental activity, subjected to One Day passExtended Fares incur a high number of end stations with Low return traffic and less number of stations with High return traffic. This denotes that Extended Fares are less desired by the customers.Base Fares incur a high number of end stations with High return traffic and less number of stations with Low return traffic. This denotes that Base Fares are more preferred by the customers.Round Trips at the end stations with Low bike return traffic.Smart bike has less number of end stations with Normal and High bike return traffic compared to other bike types. This is because Smart bikes has less number of start stations with Normal and High bike rental traffic compared to other bike types. However some stations might incur less bike returns than bike rentals subjected to Smart bikes and this scenario will result in smart bike deficiency at these partivular stations. Hence the smart bikes returns should be adjusted/redirected to match the bike demand in the specific end stations.Low bike rental activity subjected to One Day pass, reforms should be taken to redirect the bike returns from other stations to match the demand for bikes at these end stations.Extended Fares thus generating more income to the company.Column: fare_typeData type: categorical, ordinalPlot : Histogram, Bar chartDisplay the top 5 most frequent trip durations for the extended fares:
# limit the dataset to the extended fare type
extended_df = bikeshare.query(' fare_type == "Extended" ')
# obtain the most frequent trip durations for trips with extended fares
freq_minutes = extended_df.duration_min.value_counts().head(5).index
print('Most frequent extended trip durations:')
print('-'*38)
for i, minute in enumerate(freq_minutes):
print('{}. {} minutes'.format(i+1, minute))
It appears that most frequent extended fares falls under the margin of 5 minutes from the threshold of Base Fare. Plot the distribution of extended rides with a 5 minute grace period for further analysis. To prevent the effect of outliers, limit the dataset to trip durations under 120 minutes.
# calculate the percentage of the dataset that falls under `2 hour` trip duration.
df_percent = np.round((bikeshare.query(' duration_min <= 120 ').shape[0]/bikeshare.shape[0])*100, 2)
print('The percentage of the dataset that falls under 2 hour trip duration: {} %'.format(df_percent))
Plot the distribution of grace period rides over other extended rides:¶# Assign color palette and grid as per requirement
sb.set_style('white')
flatui = ["gold"]
sb.set_palette(palette = flatui, n_colors = 10, desat = 0.7)
# prepare the data for the plot
# -------------------------------------------------------
# Limit the dataset that has entries under 2 hours duration
duration_lim_120 = bikeshare.query(' duration_min <= 120 and duration_min > 30')
# Limit the dataset that has entries under 35 hours duration
duration_lim_35 = bikeshare.query(' duration_min <= 35 and duration_min > 30')
base_color = sb.color_palette()[0]
bin_edges = np.arange(30, duration_lim_120.duration_min.max()+1, 1)
x_locs = np.arange(30, 120+10, 10)
# -------------------------------------------------------
ax1 = plt.hist(duration_lim_120['duration_min'], color = base_color, bins = bin_edges)
ax2 = plt.hist(duration_lim_35['duration_min'], color = 'c', bins = bin_edges)
# improve plot aesthetics
plt.title('Distribution of extended trip durations under 2 hours\n', weight = 'bold', fontsize = 16)
plt.xlabel('\nDuration (minutes)', fontsize = 14)
plt.ylabel('Rentals (thousands)\n', fontsize = 14)
plt.xticks(x_locs, x_locs, fontsize = 12)
# convert the y_ticks into units of thousands
locs, labels = plt.yticks()
y_tick_locs = np.arange(0, int(math.ceil(max(locs)))+1000, 1000)
y_tick_names = ['{:0.0f} K'.format(loc/1000) for loc in y_tick_locs]
plt.yticks(y_tick_locs, y_tick_names, fontsize=12)
# add custom legend
# -------------------------------------------------------
custom = [Line2D([], [], color = 'c', linestyle='-', linewidth = 2),
Line2D([], [], color = base_color, linestyle='-', linewidth = 2)]
plt.legend(custom, ['Grace', 'Extended'], scatterpoints=1, frameon=True, fancybox=True,
shadow=False, framealpha = 1, borderpad=1, borderaxespad=1, labelspacing=0.5,
ncol = 1, title='Duration period', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(1.15, 1));
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.14.1 Distribution of extended trip durations under 2 hours.png', dpi=300, bbox_inches='tight')
Observation:
- It appears that, the customers most frequently tend to return the bikes just after the base fare margin. Even if this scenario is one's own result of individual preference or behaviour, this might gives rise to customer's dissatisfaction of paying extra fee (extended fare), when one could avoid it with a little caution.
- This denotes that there is a requirement for launching a remainder to notify the expiration of
Base Fareto the customers.
Calculatation of percentage of customers that are eligible for a 5 minute grace period to extended fares:¶# limit the dataset to the extended fare type
extended_df = bikeshare.query(' fare_type == "Extended" ')
grace_percent = math.ceil(extended_df.query(' duration_min > 30 and duration_min <= 35 ').shape[0]/extended_df.shape[0]*100)
print('Percentage of customers who are eligible for grace period: {} %'.format(grace_percent))
Plot the percentage of customer rides that are eligible for 5 minute grace period:
sb.set_palette('deep', n_colors=2, desat=0.6)
# prepare the data for the plot
extended_df = bikeshare.query('duration_min > 30')
counts = (extended_df.query(' duration_min <= 35 ')['trip_id'].count(),
extended_df.query(' duration_min > 35 ')['trip_id'].count())
order = ['Eligible', 'Not Eligible']
# seaborn's bar plot
sb.barplot(x = order, y = counts, alpha= 1, saturation = 0.8)
# improve plot aesthetics
plt.title('Customer rides eligible for 5 minute grace period\n', weight = 'bold', fontsize = 16)
plt.xlabel('\nCustomer eligibilty', fontsize = 14)
plt.ylabel('Number of rides (thousands)\n', fontsize = 14)
# loop through yticks to convert them into units of thousands
locs, labels = plt.yticks()
new_labels = ['{:0.0f} k'.format(loc/1000) for loc in locs]
plt.yticks(locs, new_labels, fontsize = 12)
plt.xticks(fontsize = 12)
# add annotations
# -------------------------------------------------------
total_counts = sum(counts)
# get the current tick locations and labels
locs, labels = plt.xticks()
# loop through each pair of locations and labels
for i, loc in enumerate(locs):
count = counts[i]
pct_string = '{:0.0f}%'.format(100*count/total_counts)
# print the annotation
plt.text(loc, count + (total_counts/20), pct_string, ha = 'center', color = 'black', fontsize = 14)
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.14.2 Customer rides eligible for 5 minute grace period.png', dpi=300, bbox_inches='tight')
Observation:
- The above plot depicts that
16%of the bike rides are eligible for a5 minutegrace period from chargingExtended Fare.- Since the percent of rides that are eligible for grace period is far less compared to the rides with
Extended Fares, the income generated fromExtended Fareswill only have a small impact/decrement. Hence diployment of Grace period is a fair option with little impact on income generated fromExtended Fares
Base Fare margin. Even if this scenario is one's own result of individual preference or behaviour, this might gives rise to customer's dissatisfaction of paying extra fee (extended fare), when one could avoid it with a little caution. This denotes that there is a requirement for launching a remainder to notify the expiration of Base Fare to the customers.16% of the bike rides are eligible for a 5 minute grace period from charging Extended Fare. Base Fare to the customers will alert the customer to return the bike to the nearest bike station to avoid Extended Fare will result in increased customer satisfaction.5 minute grace period to the Extended Fares. Since the percent of rides that are eligible for 5 minute grace period is far less compared to the rides with Extended Fares, the income generated from Extended Fares will only have a small impact/decrement. Hence diployment of Grace period is a fair option with little impact on the income generated from Extended Fares.Column: duration_minData type: numerical, continuousPlot : Bar chart, Pie plotCategorical distribution of trip durations:¶# compute the descriptive statistcs of trip durations
bikeshare.duration_min.describe()
The trip durations under 1 minute are probably because of the return of the bicycle immediately after the rental due to technical or other issue. Hence exclude the trips under 1 minute.
Breakdown the trip durations into categories and convert into a dataframe:
durations = {'trip_type' : pd.Series(['Small', 'Normal', 'Long', 'Very Long']),
'trip_count' : pd.Series([bikeshare.query(' duration_min >= 1 and duration_min < 10 ').shape[0],
bikeshare.query(' duration_min >= 10 and duration_min < 100 ').shape[0],
bikeshare.query(' duration_min >= 100 and duration_min < 1000 ').shape[0],
bikeshare.query(' duration_min >= 1000 ').shape[0]])}
# create a Dataframe.
trip_durations = pd.DataFrame(durations)
trip_durations
Plot the categorical distribution of trip durations:
Bar chart:
# Assign grid and color palette as per requirement
plt.figure(figsize = [12, 4])
sb.set_style("white")
base_color = 'cadetblue'
# plot pre-calculations
duration_order = ['Very Long', 'Long', 'Normal', 'Small']
time_order = ['[1000, )', '[100, 1000)', '[10, 100)', '[1 , 10)']
trip_counts = trip_durations.trip_count
trip_order = trip_durations.trip_type
x_tick_values = np.arange(0, trip_counts.max() + 50000, 50000)
x_tick_names = ['{:0.0f} K'.format(v/1000) for v in x_tick_values]
y_tick_values = np.arange(0, len(duration_order)+1, 1)
y_tick_names = duration_order
clrs = ['indianred', 'indianred', 'cadetblue', 'cadetblue']
# bar plot
sb.barplot(x = trip_counts, y = trip_order, order = duration_order, palette=clrs, alpha= 1, saturation = 1)
# plot - visual enhancements
plt.title('Categorical distribution of trip durations\n', weight = 'bold', fontsize = 16)
plt.xticks(x_tick_values, x_tick_names, fontsize = 12)
plt.yticks(y_tick_values, y_tick_names, fontsize = 12)
plt.xlabel('\nNumber of rides (thousands)', fontsize = 14)
plt.ylabel('Duration type\n', fontsize = 14)
# Create a custom legend:
# -------------------------------------------------------
# Plot empty lists with the desired label
indents = [10, 13, 11, 13]
for duration, time, indent in zip(duration_order, time_order, indents):
plt.scatter([], [], c='k', alpha=0.3,
label= '{}'.format(duration).ljust(indent, ' ') + ' - ' + '{}'.format(time))
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=True, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.5), loc = 6, labelspacing=0.5,
title='Duration - minutes', title_fontsize=14, fontsize=12, facecolor='white',
markerfirst=True, handlelength=0.5, handletextpad=0.5)
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.1 Categorical distribution of trip durations.png', dpi=300, bbox_inches='tight')
Observation:
- It appears that the most of the customers prefer to take the rides with either
Normaltrip durations orsmalltrip durations. And avoids trips withLongandVery Longdurations.
Reform:
- Even if this behaviour is to be expected, encouraging customers to ride the bikes for longer durations will generate income and improve business metrics.
- Organizing
2hr bike ralliesand other events will attract enthusiasts to ride the bike for longer durations.- Announcing
Low Faresfor tourists will attract them to rent the bike for longer durations.
Calculate the most frequent and average trip durations:¶The statistical analysis performed on the trip durations are effected by the presence of outliers. Hence individual ananlysis is performed on the dataset by limiting the dataset to trip durations under 30 minutes, 120 minutes along with the overall trip durations.
Calculate the average trip durations of the dataset under timeline limitations:
# calculate average trip durations of the dataset under timeline limitations:
overall_mean = math.ceil(bikeshare.duration_min.mean())
duration_lim_120_mean = math.ceil(bikeshare.query(' duration_min <= 120 ').duration_min.mean())
duration_lim_30_mean = math.ceil(bikeshare.query(' duration_min <= 30 ').duration_min.mean())
print('Dataset limitation'.ljust(20, ' '), ':', 'Avg. Trip duration')
print('-'*41)
print('overall'.ljust(20, ' '), ':', overall_mean, 'minutes')
print('under 120 minutes'.ljust(20, ' '), ':', duration_lim_120_mean, 'minutes')
print('under 30 minutes'.ljust(20, ' '), ':', duration_lim_30_mean, 'minutes')
Calculate the most frequent trip durations of the dataset under timeline limitations:
# calculate most frequent trip durations of various timeline limitations
overall_mode = math.ceil(bikeshare.duration_min.mode())
duration_lim_120_mode = math.ceil(bikeshare.query(' duration_min <= 120 ').duration_min.mode())
duration_lim_30_mode = math.ceil(bikeshare.query(' duration_min <= 30 ').duration_min.mode())
print('Dataset limitation'.ljust(20, ' '), ':', 'Freq. Trip duration')
print('-'*42)
print('overall'.ljust(20, ' '), ':', overall_mode, 'minutes')
print('under 120 minutes'.ljust(20, ' '), ':', duration_lim_120_mode, 'minutes')
print('under 30 minutes'.ljust(20, ' '), ':', duration_lim_30_mode, 'minutes')
Convert the most frequent and average trip durations into a dataframe:
# convert the most frequent and average trip durations into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset_duration'] = ['< 30', '< 120', 'overall']
duration_df['avg_trip_duration'] = [12, 18, 30]
duration_df['freq_trip_duration'] = [6, 6, 6]
duration_df
Plot the most frequent and average trip durations into a dataframe:
plt.figure(figsize = [12, 5])
# left plot: Average trip duration
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 1)
sb.set_style('white')
sb.set_palette(palette = "GnBu", n_colors = 3, desat = None)
# Seaborn's bar chart
ax1 = sb.barplot(data = duration_df, x = 'dataset_duration', y = 'avg_trip_duration')
# improve plot aesthetics
plt.title('Avg. Trip duration\n', weight = 'bold', fontsize = 14, color = 'dimgrey')
plt.ylabel('', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
# convert the yticks into integer values
locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# add annotations
# -------------------------------------------------------
locs, labels = plt.xticks()
duration_avg_counts = duration_df.avg_trip_duration.values
duration_avg_max = duration_avg_counts.max()
clrs = ['gold' if (value > ((duration_avg_max*4)/5)) else 'limegreen' for value in duration_avg_counts]
# loop through each pair of locations
for loc, duration_avg_count, clr in zip(locs, duration_avg_counts, clrs):
try:
count = duration_avg_count
except KeyError:
count = 0
pct_string = '{:0.0f} min'.format(math.ceil(count))
# print the annotation depending on the bar length
plt.text(loc, count + int(duration_avg_max/30), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, right=True, bottom=False, left=False);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: Most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
plt.subplot(1, 2, 2)
sb.set_style('white')
sb.set_palette(palette = "GnBu", n_colors = 3, desat = None)
# seaborl's bar plot
ax2 = sb.barplot(data = duration_df, x = 'dataset_duration', y = 'freq_trip_duration')
# improve plot aesthetics
plt.title('Most freq Trip duration\n', weight = 'bold', fontsize = 14, color = 'dimgrey')
plt.ylabel('', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# add annotations
# -------------------------------------------------------
locs, labels = plt.xticks()
duration_freq_counts = duration_df.freq_trip_duration.values
duration_freq_max = duration_freq_counts.max()
clrs = ['gold' if (value > ((duration_freq_max*4)/5)) else 'limegreen' for value in duration_freq_counts]
# loop through each pair of locations
for loc, duration_freq_count, clr in zip(locs, duration_freq_counts, clrs):
try:
count = duration_freq_count
except KeyError:
count = 0
pct_string = '{:0.0f} min'.format(math.ceil(count))
# print the annotation depending on the bar length
plt.text(loc, count + (duration_freq_max/20), pct_string, ha = 'center', color = 'black', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, right=True, bottom=False, left=False);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.2, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Trip durations of the dataset under timeline limitations\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.2 Trip durations of the dataset under timeline limitations.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that the average trip duration of the
Overalldataset is30 minutes. However when the outliers are removed and re-evaluated, the average trip duration is down to18 minutes.- Also for trips under 30 minutes, the average trip duration is
12 minutes.- The most frequent trip duration remains
6 minutesfor all datasets limited by trip durations. This denoted that most customers rent the bikes for a quick short trip.
Deeper exploration of the factors that influence the trip durations for hidden insights:
1 minute are probably because of the return of the bike immediately after the rental due to technical or other issue. Hence the trips with or under 1 minute trip durations are excluded through out the rest of the analysis on trip durations.30 minutes, 120 minutes along with the overall trip durations.Calculate the most frequent and average trip durations by trip type:¶Calculate the average trip duration and the most frequent trip duration subjected to each trip type:
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')
oneway_mean = math.ceil(duration_lim.query(' trip_type == "One Way" ').duration_min.mean())
oneway_mode = duration_lim.query(' trip_type == "One Way" ').duration_min.mode()[0]
roundtrip_mean = math.ceil(duration_lim.query(' trip_type == "Round Trip" ').duration_min.mean())
roundtrip_mode = duration_lim.query(' trip_type == "Round Trip" ').duration_min.mode()[0]
print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('oneway_mean : ', oneway_mean, 'minutes')
print('roundtrip_mean : ', roundtrip_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('oneway_mode : ', oneway_mode, 'minutes')
print('roundtrip_mode : ', roundtrip_mode, 'minutes')
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')
oneway_mean = math.ceil(duration_lim_120.query(' trip_type == "One Way" ').duration_min.mean())
oneway_mode = duration_lim_120.query(' trip_type == "One Way" ').duration_min.mode()[0]
roundtrip_mean = math.ceil(duration_lim_120.query(' trip_type == "Round Trip" ').duration_min.mean())
roundtrip_mode = duration_lim_120.query(' trip_type == "Round Trip" ').duration_min.mode()[0]
print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('oneway_mean : ', oneway_mean, 'minutes')
print('roundtrip_mean : ', roundtrip_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('oneway_mode : ', oneway_mode, 'minutes')
print('roundtrip_mode : ', roundtrip_mode, 'minutes')
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')
oneway_mean = math.ceil(duration_lim_30.query(' trip_type == "One Way" ').duration_min.mean())
oneway_mode = duration_lim_30.query(' trip_type == "One Way" ').duration_min.mode()[0]
roundtrip_mean = math.ceil(duration_lim_30.query(' trip_type == "Round Trip" ').duration_min.mean())
roundtrip_mode = duration_lim_30.query(' trip_type == "Round Trip" ').duration_min.mode()[0]
print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('oneway_mean : ', oneway_mean, 'minutes')
print('roundtrip_mean : ', roundtrip_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('oneway_mode : ', oneway_mode, 'minutes')
print('roundtrip_mode : ', roundtrip_mode, 'minutes')
Convert the most frequent and average trip durations categorized by trip type into a dataframe:
# convert the most frequent and average trip durations categorized by trip type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30', '< 120', '< 120', 'overall', 'overall']
duration_df['trip_type'] = ['One Way', 'Round Trip', 'One Way', 'Round Trip', 'One Way', 'Round Trip']
duration_df['duration_avg'] = [12, 18, 16, 39, 24, 71]
duration_df['duration_mode'] = [5, 28, 5, 28, 5, 28]
duration_df
Plot the most frequent and average trip durations into a dataframe:
plt.figure(figsize = [12, 6])
flatui = ['deepskyblue', 'sandybrown']
sb.set_palette(flatui, n_colors=2, desat=0.6)
# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'trip_type')
# improve plot aesthetics
plt.title('Avg. Trip durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
avg_rental_counts = duration_df["duration_avg"]
avg_rental_types = duration_df["trip_type"]
avg_rental_max = avg_rental_counts.max()
clrs = ['gold' if (trip == "Round Trip") else 'limegreen' for trip in avg_rental_types ]
# loop through each pair of locations and assign text
for loc, avg_rental_count, clr in zip(locs, avg_rental_counts, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f} min'.format(math.ceil(count))
# print the annotation depending on the bar length
plt.text(loc-0.2, count + int(avg_rental_max/20), pct_string, ha = 'center', color = 'black', fontsize = 12,
bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# -------------------------------------------------------
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
ax2 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'trip_type')
# improve plot aesthetics
plt.title('Most frequent durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
freq_rental_counts = duration_df["duration_mode"]
freq_rental_types = duration_df["trip_type"]
freq_rental_max = freq_rental_counts.max()
clrs = ['gold' if (trip == "Round Trip") else 'limegreen' for trip in freq_rental_types ]
# loop through each pair of locations and assign text
for loc, freq_rental_count, clr in zip(locs, freq_rental_counts, clrs):
try:
count = freq_rental_count
except KeyError:
count = 0
pct_string = '{:0.0f} min'.format(math.ceil(count))
# print the annotation depending on the bar length
plt.text(loc, count + int(freq_rental_max/5), pct_string, ha = 'center', color = 'black', fontsize = 12,
bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# -------------------------------------------------------
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 2,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5, bbox_to_anchor=(-0.5, 1.5))
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.65)
plt.suptitle('Assessment of trip durations based on trip type over datasets\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.3 Assessment of trip durations based on trip type over datasets.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that the presence of outliers has a high effect on the average trip durations. When the outliers are removed, the dataset limited under 120 minutes has an average trip duration of
39minutes forOne Waytrips and16minutes forRound Trips.- The dataset limited under 30 minutes has an average trip duration of
18minutes forOne Waytrips and12minutes forRound Trips.- This concludes that, customers tend to travel longer when it comes to
One Waytrips compared toRound Trips.- The most frequent trip duration remained same over the various dataset for
One Waytrips as28minutes. Meaning, most of the customers who prefer longer trips are using the full extent ofBase Faretrip duration when it comes toOne Waytrips.- The most frequent trip duration remained same over the various dataset for
Round Tripsas5minutes. Meaning, most of the customers prefer short trips when it comes toRound Trips.
Calculate the most frequent and average trip durations by bike type:¶Calculate the average trip duration and the most frequent trip duration subjected to each bike type:
Note: Since any conclusion/insight drawn on the unknown bike type is not helpful anymore, the unknown bike type is excluded from the analysis.
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')
standard_mean = math.ceil(duration_lim.query(' bike_type == "Standard" ').duration_min.mean())
standard_mode = duration_lim.query(' bike_type == "Standard" ').duration_min.mode()[0]
electric_mean = math.ceil(duration_lim.query(' bike_type == "Electric" ').duration_min.mean())
electric_mode = duration_lim.query(' bike_type == "Electric" ').duration_min.mode()[0]
smart_mean = math.ceil(duration_lim.query(' bike_type == "Smart" ').duration_min.mean())
smart_mode = duration_lim.query(' bike_type == "Smart" ').duration_min.mode()[0]
print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('standard_mean : ', standard_mean, 'minutes')
print('electric_mean : ', electric_mean, 'minutes')
print('smart_mean : ', smart_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('standard_mode : ', standard_mode, 'minutes')
print('electric_mode : ', electric_mode, 'minutes')
print('smart_mode : ', smart_mode, 'minutes')
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')
standard_mean = math.ceil(duration_lim_120.query(' bike_type == "Standard" ').duration_min.mean())
standard_mode = duration_lim_120.query(' bike_type == "Standard" ').duration_min.mode()[0]
electric_mean = math.ceil(duration_lim_120.query(' bike_type == "Electric" ').duration_min.mean())
electric_mode = duration_lim_120.query(' bike_type == "Electric" ').duration_min.mode()[0]
smart_mean = math.ceil(duration_lim_120.query(' bike_type == "Smart" ').duration_min.mean())
smart_mode = duration_lim_120.query(' bike_type == "Smart" ').duration_min.mode()[0]
print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('standard_mean : ', standard_mean, 'minutes')
print('electric_mean : ', electric_mean, 'minutes')
print('smart_mean : ', smart_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('standard_mode : ', standard_mode, 'minutes')
print('electric_mode : ', electric_mode, 'minutes')
print('smart_mode : ', smart_mode, 'minutes')
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')
standard_mean = math.ceil(duration_lim_30.query(' bike_type == "Standard" ').duration_min.mean())
standard_mode = duration_lim_30.query(' bike_type == "Standard" ').duration_min.mode()[0]
electric_mean = math.ceil(duration_lim_30.query(' bike_type == "Electric" ').duration_min.mean())
electric_mode = duration_lim_30.query(' bike_type == "Electric" ').duration_min.mode()[0]
smart_mean = math.ceil(duration_lim_30.query(' bike_type == "Smart" ').duration_min.mean())
smart_mode = duration_lim_30.query(' bike_type == "Smart" ').duration_min.mode()[0]
print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('standard_mean : ', standard_mean, 'minutes')
print('electric_mean : ', electric_mean, 'minutes')
print('smart_mean : ', smart_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('standard_mode : ', standard_mode, 'minutes')
print('electric_mode : ', electric_mode, 'minutes')
print('smart_mode : ', smart_mode, 'minutes')
Convert the most frequent and average trip durations categorized by bike type into a dataframe:
# convert the most frequent and average trip durations categorized by bike type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30', '< 30',
'< 120', '< 120', '< 120',
'overall', 'overall', 'overall']
duration_df['bike_type'] = ['Standard', 'Electric', 'Smart',
'Standard', 'Electric', 'Smart',
'Standard', 'Electric', 'Smart']
duration_df['duration_avg'] = [11, 13, 16,
17, 16, 31,
31, 25, 45]
duration_df['duration_mode'] = [5, 4, 7,
5, 4, 7,
5, 4, 7]
duration_df
Plot the most frequent and average trip durations into a dataframe:
plt.figure(figsize = [12, 5])
flatui = ['#ff91e2', '#60acfc', '#bd91ff']
sb.set_palette(flatui, n_colors=3, desat=0.8)
# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'bike_type', alpha = 1)
# improve plot aesthetics
plt.title('Avg. Trip durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
ax2 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'bike_type', alpha = 1)
# improve plot aesthetics
plt.title('Most frequent durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Assessment of trip durations based on bike type over datasets\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.4 Assessment of trip durations based on bike type over datasets.png', dpi=300, bbox_inches='tight')
Observations:
- The above plot depicts that customers prefer
Smartbikes over other bike types when it comes to trips with longer durations.StandardandElectricbikes have close value of average trip durations. This denotes that these bike types have almost same customer's preference when it comes to trip durations.- The difference between
Smartbike and other bikes increases with the dataset limitation. SaySmartbikes are have high average trip duration under 120 minutes compared to 30 minutes.
Calculate the most frequent and average trip durations by pass type:¶Calculate the average trip duration and the most frequent trip duration subjected to each pass type:
Note: Since Flex pass is introduced to employees for testing purpose, the Flex pass type is excluded from the analysis.
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')
walkup_mean = math.ceil(duration_lim.query(' pass_type == "Walk-up" ').duration_min.mean())
walkup_mode = duration_lim.query(' pass_type == "Walk-up" ').duration_min.mode()[0]
oneday_mean = math.ceil(duration_lim.query(' pass_type == "One Day" ').duration_min.mean())
oneday_mode = duration_lim.query(' pass_type == "One Day" ').duration_min.mode()[0]
monthly_mean = math.ceil(duration_lim.query(' pass_type == "Monthly" ').duration_min.mean())
monthly_mode = duration_lim.query(' pass_type == "Monthly" ').duration_min.mode()[0]
annual_mean = math.ceil(duration_lim.query(' pass_type == "Annual" ').duration_min.mean())
annual_mode = duration_lim.query(' pass_type == "Annual" ').duration_min.mode()[0]
print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('walkup_mean : ', walkup_mean, 'minutes')
print('oneday_mean : ', oneday_mean, 'minutes')
print('monthly_mean : ', monthly_mean, 'minutes')
print('annual_mean : ', annual_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('walkup_mode : ', walkup_mode, 'minutes')
print('oneday_mode : ', oneday_mode, 'minutes')
print('monthly_mode : ', monthly_mode, 'minutes')
print('annual_mode : ', annual_mode, 'minutes')
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')
walkup_mean = math.ceil(duration_lim_120.query(' pass_type == "Walk-up" ').duration_min.mean())
walkup_mode = duration_lim_120.query(' pass_type == "Walk-up" ').duration_min.mode()[0]
oneday_mean = math.ceil(duration_lim_120.query(' pass_type == "One Day" ').duration_min.mean())
oneday_mode = duration_lim_120.query(' pass_type == "One Day" ').duration_min.mode()[0]
monthly_mean = math.ceil(duration_lim_120.query(' pass_type == "Monthly" ').duration_min.mean())
monthly_mode = duration_lim_120.query(' pass_type == "Monthly" ').duration_min.mode()[0]
annual_mean = math.ceil(duration_lim_120.query(' pass_type == "Annual" ').duration_min.mean())
annual_mode = duration_lim_120.query(' pass_type == "Annual" ').duration_min.mode()[0]
print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('walkup_mean : ', walkup_mean, 'minutes')
print('oneday_mean : ', oneday_mean, 'minutes')
print('monthly_mean : ', monthly_mean, 'minutes')
print('annual_mean : ', annual_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('walkup_mode : ', walkup_mode, 'minutes')
print('oneday_mode : ', oneday_mode, 'minutes')
print('monthly_mode : ', monthly_mode, 'minutes')
print('annual_mode : ', annual_mode, 'minutes')
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')
walkup_mean = math.ceil(duration_lim_30.query(' pass_type == "Walk-up" ').duration_min.mean())
walkup_mode = duration_lim_30.query(' pass_type == "Walk-up" ').duration_min.mode()[0]
oneday_mean = math.ceil(duration_lim_30.query(' pass_type == "One Day" ').duration_min.mean())
oneday_mode = duration_lim_30.query(' pass_type == "One Day" ').duration_min.mode()[0]
monthly_mean = math.ceil(duration_lim_30.query(' pass_type == "Monthly" ').duration_min.mean())
monthly_mode = duration_lim_30.query(' pass_type == "Monthly" ').duration_min.mode()[0]
annual_mean = math.ceil(duration_lim_30.query(' pass_type == "Annual" ').duration_min.mean())
annual_mode = duration_lim_30.query(' pass_type == "Annual" ').duration_min.mode()[0]
print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('walkup_mean : ', walkup_mean, 'minutes')
print('oneday_mean : ', oneday_mean, 'minutes')
print('monthly_mean : ', monthly_mean, 'minutes')
print('annual_mean : ', annual_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('walkup_mode : ', walkup_mode, 'minutes')
print('oneday_mode : ', oneday_mode, 'minutes')
print('monthly_mode : ', monthly_mode, 'minutes')
print('annual_mode : ', annual_mode, 'minutes')
Convert the most frequent and average trip durations categorized by pass type into a dataframe:
# convert the most frequent and average trip durations categorized by pass type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30', '< 30', '< 30',
'< 120', '< 120', '< 120', '< 120',
'overall', 'overall', 'overall', 'overall']
duration_df['pass_type'] = ['Walk-up', 'One Day', 'Monthly', 'Annual',
'Walk-up', 'One Day', 'Monthly', 'Annual',
'Walk-up', 'One Day', 'Monthly', 'Annual']
duration_df['duration_avg'] = [17, 16, 11, 10,
32, 31, 12, 13,
52, 61, 15, 25]
duration_df['duration_mode'] = [10, 8, 5, 5,
10, 8, 5, 5,
10, 8, 5, 5]
duration_df
Plot the most frequent and average trip durations into a dataframe:
plt.figure(figsize = [12, 5])
flatui = ["#26bda7", "#9b59b6", "#3498db", "#34495e"]
sb.set_palette(flatui, n_colors=4, desat=0.8)
# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'pass_type', alpha = 0.8)
# improve plot aesthetics
plt.title('Avg. Trip durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+10, 10)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
g = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'pass_type', alpha = 0.8)
# improve plot aesthetics
plt.title('Most frequent durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())
# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Assessment of trip durations based on pass type over datasets\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.5 Assessment of trip durations based on pass type over datasets.png', dpi=300, bbox_inches='tight')
Observations:
The above plot depicts that bike rides subjected to short-term subscriptions like
One DayandWalk-uppases have higher average trip durations compared to longterm subscriptions sayMonthlyandAnnualpass. And the difference increases with the trip durations limit.
Calculate the most frequent and average trip durations by fare type:¶Calculate the average trip duration and the most frequent trip duration subjected to each fare type:
# Limit the dataset that has entries greater than 1 minute duration
duration_lim = bikeshare.query(' duration_min > 1 ')
base_mean = math.ceil(duration_lim.query(' fare_type == "Base" ').duration_min.mean())
base_mode = duration_lim.query(' fare_type == "Base" ').duration_min.mode()[0]
extended_mean = math.ceil(duration_lim.query(' fare_type == "Extended" ').duration_min.mean())
extended_mode = duration_lim.query(' fare_type == "Extended" ').duration_min.mode()[0]
print('Overall Dataset excluding 1 min')
print('='*31 + '\n')
print('Duration mean'.center(28,'-'))
print('base_mean : ', base_mean, 'minutes')
print('extended_mean : ', extended_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('base_mode : ', base_mode, 'minutes')
print('extended_mode : ', extended_mode, 'minutes')
# Limit the dataset that has entries under 120 minutes duration excluding 1 min
duration_lim_120 = bikeshare.query(' duration_min > 1 and duration_min <= 120 ')
base_mean = math.ceil(duration_lim_120.query(' fare_type == "Base" ').duration_min.mean())
base_mode = duration_lim_120.query(' fare_type == "Base" ').duration_min.mode()[0]
extended_mean = math.ceil(duration_lim_120.query(' fare_type == "Extended" ').duration_min.mean())
extended_mode = duration_lim_120.query(' fare_type == "Extended" ').duration_min.mode()[0]
print('Dataset limited under 120 min excluding 1 min')
print('='*45 + '\n')
print('Duration mean'.center(28,'-'))
print('base_mean : ', base_mean, 'minutes')
print('extended_mean : ', extended_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('base_mode : ', base_mode, 'minutes')
print('extended_mode : ', extended_mode, 'minutes')
# Limit the dataset that has entries under 30 minutes duration excluding 1 min
duration_lim_30 = bikeshare.query(' duration_min > 1 and duration_min <= 30 ')
base_mean = math.ceil(duration_lim_30.query(' fare_type == "Base" ').duration_min.mean())
base_mode = duration_lim_30.query(' fare_type == "Base" ').duration_min.mode()[0]
# extended fare statistics are not calculated as they do not exist under 30 minutes
print('Dataset limited under 30 min excluding 1 min')
print('='*44 + '\n')
print('Duration mean'.center(28,'-'))
print('base_mean : ', base_mean, 'minutes')
print('\n')
print('Duration mode'.center(28,'-'))
print('base_mode : ', base_mode, 'minutes')
Convert the most frequent and average trip durations categorized by fare type into a dataframe:
# convert the most frequent and average trip durations categorized by fare type into a dataframe:
duration_df = pd.DataFrame()
duration_df['dataset'] = ['< 30', '< 30',
'< 120', '< 120',
'overall', 'overall']
duration_df['fare_type'] = ['Base', 'Extended',
'Base', 'Extended',
'Base', 'Extended']
duration_df['duration_avg'] = [12, np.nan,
12, 57,
12, 119]
duration_df['duration_mode'] = [6, np.nan,
6, 31,
6, 31]
duration_df
Plot the most frequent and average trip durations into a dataframe:
plt.figure(figsize = [12, 5])
flatui = ["#e278fa", "#787efa"]
sb.set_palette(flatui, n_colors=2, desat=0.8)
# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_avg', hue = 'fare_type', alpha = 0.8)
# improve plot aesthetics
plt.title('Avg. Trip durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
locs, labels = plt.yticks()
y_ticks_new = np.arange(0, int(math.ceil(max(locs)))+25, 25)
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
sb.despine(top=True, bottom=False, left=False, right=True);
# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
avg_rental_counts = duration_df["duration_avg"]
avg_rental_types = duration_df["fare_type"]
avg_rental_max = avg_rental_counts.max()
clrs = ['mediumpurple' if (trip == "Extended") else 'violet' for trip in avg_rental_types ]
# get the current tick locations and labels
# locs, labels = plt.xticks()
# loop through each pair of locations and labels
for loc, avg_rental_count, clr in zip(locs, avg_rental_counts, clrs):
try:
count = avg_rental_count
except KeyError:
count = 0
# print the pct string if the count is not 'nan'
if count == count:
pct_string = '{:0.0f} min'.format(math.ceil(count))
# print the annotation depending on the bar length
plt.text(loc-0.2, count + int(avg_rental_max/20), pct_string, ha = 'center', color = 'black', fontsize = 12,
bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# =====================================================
# /////////////////////////////////////////////////////
# right plot: point plot - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
ax2 = sb.pointplot(data = duration_df, x = 'dataset', y = 'duration_mode', hue = 'fare_type', alpha = 0.8)
# improve plot aesthetics
plt.title('Most frequent durations\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Duration (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip durations (minutes)', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())
# set y-axis limits to be same as left plot and assign same yticks
plt.ylim(ax1.get_ylim())
plt.yticks(y_ticks_new, y_ticks_new, fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
sb.despine(top=True, bottom=False, left=False, right=True);
# add annotations
# -------------------------------------------------------
locs = [0, 0, 1, 1, 2, 2]
freq_rental_counts = duration_df["duration_mode"]
freq_rental_types = duration_df["fare_type"]
freq_rental_max = freq_rental_counts.max()
clrs = ['mediumpurple' if (trip == "Extended") else 'violet' for trip in freq_rental_types ]
# loop through each pair of locations and labels
for loc, freq_rental_count, clr in zip(locs, freq_rental_counts, clrs):
try:
count = freq_rental_count
except KeyError:
count = 0
# print the pct string if the count is not 'nan'
if count == count:
pct_string = '{:0.0f} min'.format(math.ceil(count))
# print the annotation depending on the bar length
plt.text(loc, count + int(freq_rental_max/5), pct_string, ha = 'center', color = 'black', fontsize = 12,
bbox={'pad':1.9,'alpha':0.2,'color':'none','fc':clr})
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.75)
plt.suptitle('Assessment of trip durations based on fare type over datasets\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.15.6 Assessment of trip durations based on fare type over datasets.png', dpi=300, bbox_inches='tight')
Observations:
- The above plots depicts that bike trips subjected to
Extended fareshas a heavy presence of outliers. When outliers are removed and dataset is limited to 120 minutes the average trip duration subjected toExtended Faresis57 minutes.- The average trip duration subjected to
Base Faresis12 minutes, while frequent trip duration is6 minutes.- The most frequent trip duration of
Extended Fareis31 minutes. This denotes that many customers who intended to return the bikes forBase Farefailed in returning the bike under 30 minutes by a margin of 1 minute.
Normal trip durations or small trip durations and avoids trips with Long and Very Long durations.Overall dataset is 30 minutes. However when the outliers are removed and re-evaluated, the average trip duration is down to 18 minutes.12 minutes.6 minutes for all datasets limited by trip durations. This denoted that most customers rent the bikes for a quick short trip.One Way trips compared to Round Trips.One Way trips as 28 minutes. Meaning, most of the customers who prefer longer trips are using the full extent of Base Fare trip duration when it comes to One Way trips.Round Trips as 5 minutes. Meaning, most of the customers prefer short trips when it comes to Round Trips.Smart bikes over other bike types when it comes to trips with longer durations.Smart bike and other bikes increases with the dataset limitation. Say Smart bikes are have high average trip duration under 120 minutes compared to 30 minutes.Standard and Electric bikes have close value of average trip durations. This denotes that these bike types have almost same customer's preference when it comes to trip durations.One Day and Walk-up pases have higher average trip durations compared to longterm subscriptions say Monthly and Annual pass. And the difference increases with the trip durations limit.Extended fares has a heavy presence of outliers. When outliers are removed and dataset is limited to 120 minutes the average trip duration subjected to Extended Fares is 57 minutes.Extended Fare is 31 minutes. This denotes that many customers who intended to return the bikes for Base Fare failed in returning the bike under 30 minutes by a margin of 1 minute.Normal trip durations or small trip durations and avoids trips with Long and Very Long durations. Even if this behaviour is to be expected, encouraging customers to ride the bikes for longer durations will generate income and improve business metrics. 2hr bike rallies and other events will attract enthusiasts to ride the bike for longer durations.Low Fares for tourists will attract them to rent the bike for longer durations.Column: distance_milesData type: numerical, continuousPlot : Bar chart, Pie plotCategorical distribution of trip distances:¶# compute the descriptive statistcs of trip distances
bikeshare.distance_miles.describe()
As the distance (displacement in this scenario) is dependent on the start_station co-ordinates and end_station co-ordinates, the entries with Round Trips will have 0 miles extracted as distance_milessince we basically calculated displacement. Hence categotize the trips with 0 miles as Round Trips.
Breakdown the trip distances into categories and convert into a dataframe:
distances = {'trip_type' : pd.Series(['Round Trip', 'Very Small', 'Small', 'Normal', 'Long', 'Very Long']),
'trip_count' : pd.Series([bikeshare.query(' distance_miles == 0 ').shape[0],
bikeshare.query(' distance_miles > 0 and distance_miles < 0.1 ').shape[0],
bikeshare.query(' distance_miles >= 0.1 and distance_miles < 0.5 ').shape[0],
bikeshare.query(' distance_miles >= 0.5 and distance_miles < 1 ').shape[0],
bikeshare.query(' distance_miles >= 1 and distance_miles < 10 ').shape[0],
bikeshare.query(' distance_miles >= 10 ').shape[0]])}
# create Dataframe.
trip_distances = pd.DataFrame(distances)
trip_distances
Plot the categorical distribution of trip durations:
Bar chart:
# Assign grid and color palette as per requirement
plt.figure(figsize = [32, 8])
sb.set_style("white")
# plot pre-calculations
base_color = sb.color_palette()[0]
dist_order = ['Very Long', 'Long', 'Normal', 'Small', 'Very Small', 'Round Trip']
time_order = ['[10, )', '[1, 10)', '[0.5, 1)', '[0.1, 0.5)', '(0, 0.1)', '[0]']
trip_counts = trip_distances.trip_count
trip_order = trip_distances.trip_type
x_tick_values = np.arange(0, trip_counts.max() + 50000, 50000)
x_tick_names = ['{:0.0f} K'.format(v/1000) for v in x_tick_values]
y_tick_values = [0, 1, 2, 3, 4, 5]
y_tick_names = dist_order
clrs = ['indianred', '#585370', '#585370', '#585370', 'indianred', '#674c78']
# bar plot
sb.barplot(x = trip_counts, y = trip_order, order = dist_order, palette=clrs, alpha= 1, saturation = 1)
# plot - visual enhancements
plt.title('Categorical distribution of Trip distances\n', weight = 'bold', fontsize = 30)
plt.xticks(x_tick_values, x_tick_names, fontsize = 22)
plt.yticks(y_tick_values, y_tick_names, fontsize = 22)
plt.xlabel('\nNumber of trips (thousands)', fontsize = 26)
plt.ylabel('Distance type\n', fontsize = 26)
# Create a legend:
# -------------------------------------------------------
indents = [10, 13, 12, 14, 11, 11]
# Plot empty lists with the desired label
for dist, time, indent in zip(dist_order, time_order, indents):
plt.scatter([], [], c='k', alpha=0.3,
label= '{}'.format(dist).ljust(indent, ' ') + ' - ' + '{}'.format(time))
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=True, ncol = 1, framealpha = 1,
borderpad=1, borderaxespad=1, bbox_to_anchor = (1, 0.5), loc = 6, labelspacing=0.5,
title='Distance - miles', title_fontsize=24, fontsize=22, facecolor='white',
markerfirst=True, handlelength=0.5, handletextpad=0.5)
# -------------------------------------------------------
sb.despine();
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.1 Categorical distribution of Trip distances.png', dpi=300, bbox_inches='tight')
Observation:
- The above plot deipcts that there is a fair amount of bike rental distribution when it comes to trip distances of
Small,NormalandLarge. This represents a productive business model covering all kinds of customer needs.- Only a small number of customers exists who take
Very SmallandVery Longbike rides. This is also a good sign, asVery Smallrides represent customer dissatisfaction and having more number ofVery Longrides will result in dissipations of bike availability among a geographical location.- A good amount of customers preferred
Round Trips.
Reform:
- Organizing long bike rallies and other events will attract enthusiasts to ride the bike for longer distances.
Deeper exploration of the factors that influence the trip distances for hidden insights:
Limit the dataset to 3 miles to minimize the influence of outliers:
# calculate the percentage of the dataset that falls under `3 miles` trip duration.
data_percent = np.round((bikeshare.query(' distance_miles <= 3 ').shape[0]/bikeshare.shape[0])*100, 2)
print('percent of trips that fall under 3 miles trip distances is : {} %'.format(data_percent))
3 miles constitute 99% of the distance distribution. Round Trip entries have a distance/displacement equal to Zero and are clustered together unlike One Way trips which are distributed between 1-25 miles. Hence remove the entries with distance value 0 in the further analysis to obtain the correct stats.Calculate the most frequent and average trip distances by trip type:¶Limit the dataset to the entries under 3 miles trip distances. As round trips are involved, entries with 0 distances are included.
# Limit the dataset to the entries under 3 miles distance
distance_lim_3 = bikeshare.query(' distance_miles <= 3 ')
Calculate the average trip distance and the most frequent trip distance subjected to each trip type.
plt.figure(figsize = [12, 5])
sb.set_palette(palette = "GnBu_d", n_colors = 2, desat = None)
base_color = sb.color_palette()[0]
# left plot: bar chart - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.barplot(data = distance_lim_3, x = "trip_type", y = "distance_miles", hue = 'trip_type')
# improve plot aesthetics
plt.title('Avg. Trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (minutes)\n', fontsize = 14)
plt.xlabel('\nTrip type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
# add annotations
# -------------------------------------------------------
for p in ax1.patches:
ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
# prepare the data for the plot
oneway_mode = distance_lim_3.query(' trip_type == "One Way" ').distance_miles.mode()[0]
roundtrip_mode = distance_lim_3.query(' trip_type == "Round Trip" ').distance_miles.mode()[0]
heights = [oneway_mode, roundtrip_mode]
labels = distance_lim_3.trip_type.sort_values(ascending=True).unique()
g = sb.barplot(x = labels, y = heights, hue = labels)
# improve plot aesthetics
plt.title('Most frequent trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nTrip type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())
plt.yticks(fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,
title='Trip type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on trip type\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.2 Assessment of trip distances under 3 miles based on trip type.png', dpi=300, bbox_inches='tight')
Observation:
One Waytrips has an average trip duration of0.8 miles, while the most frequent trip duration is0.5 miles.- As
Round Tripsconstitute a total displacement of0, the statistical analysis cannot be performed on it.
Calculate the most frequent and average trip distances by bike type:¶Note: Since any conclusion/insight drawn on the unknown bike type is not helpful anymore, the unknown bike type is excluded from the analysis.
Limit the dataset to the entries under 3 miles and remove entries with 0 trip distances:
# Limit the dataset to the entries under 3 miles distance and remove entries with '0' distances
distance_lim_3 = bikeshare.query(' distance_miles <= 3 and distance_miles > 0 and bike_type != "unknown" ').copy()
# categorize the bike type variable
level_order = ['Standard', 'Electric', 'Smart']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
distance_lim_3['bike_type'] = distance_lim_3['bike_type'].astype(ordered_cat)
Calculate the average trip distance and the most frequent trip distance subjected to each bike type.
# Assign color palette and figure size as per requirement
plt.figure(figsize = [12, 5])
sb.set_style('white')
flatui = ['#60acfc', '#91ffda', '#bd91ff']
sb.set_palette(flatui, n_colors=4, desat=0.8)
# left plot: bar chart - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.barplot(data = distance_lim_3, x = "bike_type", y = "distance_miles", hue = 'bike_type', dodge=False)
# improve plot aesthetics
plt.title('Avg. Trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (minutes)\n', fontsize = 14)
plt.xlabel('\nBike type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
# add annotations
# -------------------------------------------------------
for p in ax1.patches:
ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
# prepare the data for the plot
standard_mode = distance_lim_3.query(' bike_type == "Standard" ').distance_miles.mode()[0]
electric_mode = distance_lim_3.query(' bike_type == "Electric" ').distance_miles.mode()[0]
smart_mode = distance_lim_3.query(' bike_type == "Smart" ').distance_miles.mode()[0]
heights = [standard_mode, electric_mode, smart_mode]
labels = distance_lim_3.bike_type.sort_values(ascending=True).unique()
g = sb.barplot(x = labels, y = heights, hue = labels, dodge=False)
# improve plot aesthetics
plt.title('Most frequent trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nBike type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.ylim(ax1.get_ylim())
plt.yticks(fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Bike type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in g.patches:
g.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on bike type\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.3 Assessment of trip distances under 3 miles based on bike type.png', dpi=300, bbox_inches='tight')
Observation:
- The above plot depicts that customers preferred
Smartbikes for longer trip distances compared to other bike types.- The bike types
StandardandElectrichas same trip durations and frequent trip durations. This conveys that the customers have a similar preference overStandardandElectricbikes.
Calculate the most frequent and average trip distances by pass type:¶Note: Since Flex pass is introduced to employees for testing purpose, the Flex pass type is excluded from the analysis.
Limit the dataset to the entries under 3 miles and remove entries with 0 trip distances:
# Limit the dataset to the entries under 3 miles distance and remove entries with '0' distances
distance_lim_3 = bikeshare.query(' distance_miles <= 3 and distance_miles > 0 and pass_type != "Flex" ').copy()
# categorize the pass type variable
level_order = ['Walk-up', 'One Day', 'Monthly', 'Annual']
ordered_cat = pd.api.types.CategoricalDtype(ordered = True, categories = level_order)
distance_lim_3['pass_type'] = distance_lim_3['pass_type'].astype(ordered_cat)
Calculate the average trip distance and the most frequent trip distance subjected to each pass type.
# Assign color palette and figure size as per requirement
plt.figure(figsize = [12, 5])
sb.set_style('white')
flatui = ["#26bda7", "#9b59b6", "#3498db", "#34495e"]
sb.set_palette(flatui, n_colors=4, desat=0.8)
# left plot: bar chart - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.barplot(data = distance_lim_3, x = "pass_type", y = "distance_miles", hue = 'pass_type', dodge=False)
# improve plot aesthetics
plt.title('Avg. Trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (minutes)\n', fontsize = 14)
plt.xlabel('\nPass type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
# add annotations
# -------------------------------------------------------
for p in ax1.patches:
ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
# prepare the data for the plot
walkup_mode = distance_lim_3.query(' pass_type == "Walk-up" ').distance_miles.mode()[0]
oneday_mode = distance_lim_3.query(' pass_type == "One Day" ').distance_miles.mode()[0]
monthly_mode = distance_lim_3.query(' pass_type == "Monthly" ').distance_miles.mode()[0]
annual_mode = distance_lim_3.query(' pass_type == "Annual" ').distance_miles.mode()[0]
heights = [walkup_mode, oneday_mode, monthly_mode, annual_mode]
labels = distance_lim_3.pass_type.sort_values(ascending=True).unique()
ax2 = sb.barplot(x = labels, y = heights, hue = labels, dodge=False)
# improve plot aesthetics
plt.title('Most frequent trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nPass type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper right', labelspacing=0.5, ncol = 1,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in ax2.patches:
ax2.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# adjust the two plots to have the same ylimits/yticks
if ax1.get_ylim() < ax2.get_ylim():
ax1.set_ylim(ax2.get_ylim())
else:
ax2.set_ylim(ax1.get_ylim())
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on pass type\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.4 Assessment of trip distances under 3 miles based on pass type.png', dpi=300, bbox_inches='tight')
Observation:
- The above plot depicts that the rides subjected to
Monthlypass has lower average bike rentals compared to other pass types.- The short-term passes say
One DayandWalk-uphas higher value of frequent trip duration compared to longterm subscriptions. This is because the longterm passes are opted by working individuals for a ride to work, while short-term passes are preferred by tourists and explorers to explore city for longer rides.
Calculate the most frequent and average trip distances by fare type:¶Limit the dataset to the entries under 3 miles and remove entries with 0 trip distances:
# Limit the dataset to the entries under 3 miles distance and remove entries with '0' distances
distance_lim_3 = bikeshare.query(' distance_miles <= 3 and distance_miles > 0 ').copy()
Calculate the average trip distance and the most frequent trip distance subjected to each fare type.
# Assign color palette and figure size as per requirement
plt.figure(figsize = [12, 5])
flatui = ["#fff480"]
sb.set_palette(flatui, n_colors=1, desat=0.8)
base_color = sb.color_palette()[0]
# left plot: point plot - Avg trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 1)
ax1 = sb.barplot(data = distance_lim_3, x = "fare_type", y = "distance_miles",
hue = 'fare_type', alpha = 0.8, dodge=False)
# improve plot aesthetics
plt.title('Avg. Trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.ylabel('Avg. distance (miles)\n', fontsize = 14)
plt.xlabel('\nFare type', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)
# plot empty legend
plt.legend('', frameon=False, fancybox=False)
# add annotations
# -------------------------------------------------------
for p in ax1.patches:
ax1.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# right plot: Bar chart - most frequent trip duration
# =====================================================
# /////////////////////////////////////////////////////
sb.set_style('white')
plt.subplot(1, 2, 2)
# prepare the data for the plot
base_mode = distance_lim_3.query(' fare_type == "Base" ').distance_miles.mode()[0]
extended_mode = distance_lim_3.query(' fare_type == "Extended" ').distance_miles.mode()[0]
heights = [base_mode, extended_mode]
labels = distance_lim_3.fare_type.sort_values(ascending=True).unique()
ax2 = sb.barplot(x = labels, y = heights, hue = labels, alpha = 0.8, dodge=False)
# improve plot aesthetics
plt.title('Most frequent trip distance\n\n', weight = 'bold', fontsize = 16, color = 'dimgrey')
plt.xlabel('\nFare type', fontsize = 14)
plt.ylabel('Distance (miles)\n', fontsize = 14)
plt.xticks(fontsize = 12)
plt.yticks(fontsize=12)
# plot legend
plt.legend(scatterpoints=1, frameon=True, fancybox=True, shadow=False, framealpha = 1,
borderpad=1, borderaxespad=1, loc = 'upper left', labelspacing=0.5, ncol = 1,
title='Pass type', title_fontsize=12, fontsize=10, facecolor='white',
markerfirst=True, handlelength=2, handletextpad=0.5)
# add annotations
# -------------------------------------------------------
for p in ax2.patches:
ax2.annotate(format(p.get_height(), '.1f'), (p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center', xytext = (0, 10), textcoords = 'offset points', fontsize = 12)
# -------------------------------------------------------
sb.despine(top=True, bottom=False, left=False, right=True);
# =====================================================
# /////////////////////////////////////////////////////
# adjust the two plots to have the same y axis limits
if ax1.get_ylim() < ax2.get_ylim():
ax1.set_ylim(ax2.get_ylim())
else:
ax2.set_ylim(ax1.get_ylim())
plt.subplots_adjust(wspace=0.3, hspace=0.3)
plt.subplots_adjust(top=0.7)
plt.suptitle('Assessment of trip distances under 3 miles based on fare type\n', fontsize = 16, weight = 'bold');
# savefig by passing (bbox_inches='tight'),which will adjust the figure to include all of the x and y labels
plt.savefig('explanatory plots/3.16.5 Assessment of trip distances under 3 miles based on fare type.png', dpi=300, bbox_inches='tight')
Observation:
- The above plot depicts that trips with
Extended Fareshas a high value of frequent trip duration compared toBase Fare. This is to be expected as trip distances and trip durations are correlated.
Small, Normal and Large. This represents a productive business model covering all kinds of customer needs.Very Small and Very Long bike rides. This is also a good sign, as Very Small rides represent customer dissatisfaction and having more number of Very Long rides will result in dissipations of bike availability among a geographical location.Round Trips.One Way trips has an average trip duration of 0.8 miles, while the most frequent trip duration is 0.5 miles.Smart bikes for longer trip distances compared to other bike types.Monthly pass has lower average bike rentals compared to other pass types.One Day and Walk-up has higher value of frequent trip duration compared to longterm subscriptions. This is because the longterm passes are opted by working individuals for a ride to work, while short-term passes are preferred by tourists and explorers to explore city for longer rides.Extended Fares has a high value of frequent trip duration compared to Base Fare. This is to be expected as trip distances and trip durations are correlated.Investigation Summary:¶=====================================
Insights: |
|
|---|---|
Insight 1 |
The aggregated distribution of bike rentals over all years, suggest that the customers prefer One Way trips compared to Round Trip's for bike rental with a grey area in the Early hours of the day, where the average number of bike rentals are very low and statistically not significant for comparision. |
Insight 2 |
The average number of bike rentals subjected to One Way trips decreases during Saturaday's and Sunday's, while Round Trips experiece a slight increase. |
Insight 3 |
The first half of the year 2019 experiences a relatively low number of bike rentals subjected to One Way trips, compared to later half of the year. This trend is not limited to year 2019 but consistent over the other years. |
Insight 4 |
The Standard bike rentals subjected to One Way trips has reduced significantly during 2019 compared to previous year 2018. At the same there is a noticable increase in Electric bike rentals subjected to One Way trips. This depicts that the customers preffered Electric bikes over Standard bikes for One Way trips in the year 2019. However, as the Electric bike was introduced during the end of the year 2018, the customers that used Standard bikes suddenly shifted towrds Electric bikes after the second quarter of the year 2019. This states that the customer's preference over the bike has changed but the total number of bike rentals subjected to One Way trips has not decreased. |
Insight 5 |
Customers that pay Base fare prefer One Way trips, while the customers that pay Extended fares takes almost same number of Round Trips as of One Way trips and does not exhibit any preference over trip types. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Care should be taken to increase the number of bike rentals during the end of the week. Organizing events such as Bike rally's will significantly increases the bike rentals during the holidays/weekends. |
Reform 2 |
Announing discounts on One Way trips from stations with high bike count to stations with Low bike count during the weekdays will normalize the distribution of bike over all stations. |
Reform 3 |
Promotions/discounts should be offered on One Way trips over the first half of the year to encourage the customers to take more number of One Way trips. |
Reform 4 |
Having less number of customers that pay Extended fares subjected to One Way trips has a positive end result and does not require any intervene. Because if more number of customers take longer rides (as fares and bike durations are correlated) subjected to One Way trips, then the bikes will end up at a farther location from their home station and might create a gap between the supply and demand of the bikes at these stations. However, as more number of customers prefer One Way trips for less duration rides (base fare), the bikes will end up in the same geographical cluster which eases the redirection of customers to the nearby available stations in case of bike deficiency. |
Insights: |
|
|---|---|
Insight 1 |
The classification of bikes was introduced at the end of the year 2018. Hence the rentals related to unknown bike category are ignored and the analysis limited to the year 2019. |
Insight 2 |
The bike rentals for the Standard bike type decreases over the year 2019, while the rentals for the bike type Smart and Electric increases with in the timeline of the year. Hence, even though Standard bikes are popular during the start of the year 2019, customers preferred Smart and Electric bikes towards the end of the year 2019. Hence it can be concluded that the lauch of Electric and Smart bikes are a success. |
Insight 3 |
Even though Standard bikes are the most popular choice during the first quarter of the year, the Electric bikes gradually gained popularity among One Way trips over the rest of the year. |
Insight 4 |
The customers that take Round Trips does not have any preference over bike types. |
Insight 5 |
Even though Standard bikes are the most popular choice for the customers with One Day pass during the first quarter of the year, the number of bike rentals subjected to One Day pass decreased to a point that there is no significant difference in bike pereference between Standard bikes and Smart bikes towards the end of the year 2019. |
Insight 6 |
The customers that has Monthly pass preferred Standard bikes during the first quarter of the year, however the Electric bikes gained more popularity over the rest of the year 2019. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Even though Smart bikes were introduced along with the Electric bikes, they failed to gain as much popularity as of Electric bikes. Hence dicounts should be announced to increase the rental activity of Smart bikes during the peak hours, which inturn helps the stations to maintain the availabilty of other bikes types. |
Reform 2 |
Use Smart bikes in promotional events like Bike rallies to familiarize customers with its features and encourage the customers to prefer Smart bikes in the future. |
Insights: |
|
|---|---|
Insight 1 |
Monthly pass has always been the most popular choice for the customers. And discontinuation of Walk-up pass in 2019 has even more increased the number of bike rentals subjectd to Monthly pass. |
Insight 2 |
There is a slight increase in the rentals subjected to Annual pass in the year 2019. |
Insight 3 |
Majority of the bike rentals subjected to One Way trips are taken by the customers with Monthly subscription. |
Insight 4 |
The number bike rentals taken on One Day subscription experienced a steady decrease subjected to One Way trips over the years 2018 and 2019, which might be the reason for the increase in monthly subscibers for the second half of the year 2019. |
Insight 5 |
Majority of the bike rentals subjected to Round Trips are taken by the customers with One Day subscription. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Discounts should be announced on One Day customer pass subscription to encourage the tourists and non-subscribers to rent a bike. |
Insights: |
|
|---|---|
Insight 1 |
The majority of the customers utilize base fare option to reach their destintions. However, the recent year 2019 experienced a relatively less number of bike rentals for first and second quarters as compared to the third and fourth quarters. Reforms must be taken to increase the bike rentals for the first half of the yearly timeline. |
Insight 2 |
Around 25.5% of the bike rentals generated extra income in the form of Extended fares, which potrays a good business model. However, the average number of the bike rentals subjectd to Extended fare for the year 2019 are relatively less than 2018 and need to be increased by adopting new rentals techniques that encourage customers to ride the bikes for longer duration of time. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Discounts/promotions should be announced to encourage the customers to ride bikes for longer durations. |
Insights: |
|
|---|---|
Insight 1 |
Based on classification of aggregated bike rentals over various parameters, it can be concluded that most customers prefer standard bike over smart bikes, takes more One Way trips than Round Trip's, and prefers Monthly Pass over other subscriptions. |
Insights: |
|
|---|---|
Insight 1 |
The bike rentals aggregated by the hour of the day, depicts that the rentals slowly starts to increase from 6:00 AM untill 5:00 PM with the peaks at 8:00 AM, 12:00 PM, and 5:00 PM, which represent Morning office hours, Afternoon Lunch time, and Evening office relieve timings respectively. This concludes that the huge portion of the customer database contain working individuals, who use bikes for the transportatioin. |
Insight 2 |
The average bike rentals over the hour of the day depicts that the rentals are least during night and early hours. |
Insight 3 |
The bike rentals decrease during the non-working days such as Saturday and Sunday. This reinforces the argument that the majority of the customer base consists of working individuals. |
Insight 4 |
The average number of bike rentals taken by non-subscribers and tourists (Walk-up pass or One Day pass) is less than the number of bike rentals taken by Working Individuals (with Monthly pass). This reinforces the argument that the majority of customer database is compromised of working individuals. |
Reforms proposed: |
|
|---|---|
Reform 1 |
The year 2019 experieces a steep decrease in bike rentals during the non-working days compared to previous years. This reflects the failure in attraction of tourists and non-subscribers to ride a bike over weekends. Promotions should be announced for tourists and non-subscribers to encourage them to rent a bike. |
Reform 2 |
Encouraging working individuals to ride a bike during non-working days in a week will increase the revenue generation. |
Insights: |
|
|---|---|
Insight 1 |
The rental activity is highest around Afternoon, with Morining and Evening being closest. This denotes that the customers use bike rentals the most during daytime.Subsequently the rental activity is least at Early Hours and Night times. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Promoting morning fitness activities such as Morning bike challenges will potentially increase the bike rental activity during Early Hours of the day. |
Reform 2 |
While tie up with night events will boost Night-time bike rentals. |
Insights: |
|
|---|---|
Insight 1 |
The bike rentals aggregated over the day of the month depicts that the rentals decrease slightly during the end of the month. However on deeper analysis of the data by calculating the average bike rentals, it is clear that the rental activity actually increases during the end of the month. |
Insight 2 |
Also, the distribution of average bike rentals over the day of the month, ranges between 700 and 800 only. This depicts that there is no significant differance in average bike rentals subjected to any two days given in a month. |
Insights: |
|
|---|---|
Insight 1 |
The years 2017, and 2018 have a relatively slight decrease in average bike rentals compared to other days in the week, however the year 2019 experience a sudden drop in average bike rentals during weekends say Saturday and Sunday. This is not a good sign for a healthy business model and requires reforms. |
Insight 2 |
The customers with long term subscriptios such as Annual pass and Monthly pass prefer Standard bikes and Electric Bikes to travel during working days/weekdays and less likely to travel during weekends. As the customers database contain a majority of working individuals, they tend to prefer One Way trips which decreases during weekends. |
Insight 3 |
The above plot depicts that even if the customers that take One Way trips (probably working individuals who ride to work) decreases over weekends, the customers that take Round Trips increases during the weekends. |
Insight 4 |
The pass_type holds a stronger influence on the bike rentals over the week rather than bike_type. |
Insight 5 |
The smart bike experiences a slight increase in average bike rentals over weekends. |
Insight 6 |
New/temporary customers with no existing pass (say tourists/travellers/activists) tend to take short term pass such as One Day pass and prefer Standard bikes and Smart bikes. Hence Smart bikes experince highest bike rentals during the weekends. Also this category of customers tend to take Round Trips and ride for longer durations resulting in Extended fares thus generating more income to the company. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Organizing/promoting, fitness/recreational activities like Bike rallies will potentially increase the bike rentals on the weekends/holidays, significantly. |
Reform 2 |
The number of customers that take One Day pass who prefer Standard bikes reduced significantly during 2019. Hence attracting this category customers to use standard bikes will enhance the business model significantly. |
Reform 3 |
As major part of the customer database is compromised of working individuals, seize the advantage of low rentals during the weekend and take reforms to normalize the availability of bikes over the stations to support the bike rental traffic on the monday. |
Insights: |
|
|---|---|
Insight 1 |
The stations with ID's (4143, 4327, 4362, 4363, 4373, 4490, 4321, 4467, 4468) has very low bike activity (rentals and returns combined) and deemed as high maintainence. The said stations does not even constitute to 10 bike activities over the span of 3 years. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Hence these stations are financially not suitable for further business and need to be either terminated or relocated to locations with potential bike traffic. |
Insights: |
|
|---|---|
Insight 1 |
A window period of 6 hours between (8:00 AM - 14:00 PM) experiences a shortage of supply in bikes compared to demand in bikes by the customers. However the gap in supply and demand is very lean and does not require any immediate attention. |
Insights: |
|
|---|---|
Insight 1 |
The bike rental traffic is not equally distributed over the start stations. However this does not imply that these start stations are to be eliminated as they might incur good bike return traffic and still prove to be a station that procure acceptable business metrics. |
Insight 2 |
A major number of stations subjected to Round Trips experience Low and Very Low rental traffic. This reveals the need of improving rental traffic at the start stations subjected to Round Trips. |
Insight 3 |
The number of start stations that experience High and Very high bike rental traffic for Round Trips is less than that of One Way trips. This denotes that One Way trips are more popular among the customers. |
Insight 4 |
There exists very small number of start stations subjected to bike types, that experience Very Low bike rental activity, which is a good sign for healthy business model. However, the number of start stations that experince Very High rental activity is also very small. This limits the usage of start stations from serving its full potential. |
Insight 5 |
As the number of start stations that experiences Low and Very Low bike rental activity subjected to bike type are clustered closely, this unravels that the bike type has no influence on the rental activity at these stations and there might be other factors that caused the decrease in bike rental activity at these specitic stations. So no action subjected to bike type is required to increase the bike rental activity at start stations with Low and Very Low rental activity. |
Insight 6 |
Smart bike has less number of start stations with Normal and High rental traffic compared to other bike types. This reflects that Smart bikes requires more advertisement and awareness among customers. |
Insight 7 |
A major number of start stations subjected to Annual pass experiences either Low or Very Low rental traffic. As Annual pass is a long-term subscription, this behaviour is to be expected. |
Insight 8 |
All the start stations subjected to Flex pass are compromised into Low and Very Low bike rental traffic. This is because Flex pass is originally issued for testing puspose for employees. Hence this insight is ignored. |
Insight 9 |
It appears that a fair number of start stations experience Low rental traffic subjected to Monthly pass type. As Monthly pass is mostly preferred by working individuals, any action taken will not encourage them to switch their start station preference, which might result in delay to their ride to work. Hence no action is required to increase rental activity subjected to Monthly pass type at start stations with Low rentnal activity. |
Insight 10 |
There exists many start stations with relatively Low bike rental activity, subjected to One Day pass. This might be due to the influence of its geographical location or acquisition of bike rentals related to other customer pass types. |
Insight 11 |
A major number of start stations subjected to Extended fare types experience Low rental traffic. Action should be taken to encourage customers to take longer trips at these stations to increase income generation. |
Insight 12 |
A major number of start stations subjected to Base fare experience Normal and Higher rental traffic. This is a good sign for healthy business model with higher average in bike rental traffic over time. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Discounts or promotional activities should be announced for Round Trips to encourage customers to rent bikes from the staions with Low and Very Low bike rental activity subjected to Round Trips. |
Reform 2 |
Annouce promotions on Smart bikes to increase their rental activity at start stations. |
Reform 3 |
Promotions should be announced to increase the rental traffic subjected to One Day passes at start stations with Low bike rental activity. |
Reform 4 |
Promotions should be announced to encourage customers to take longer trips at the stations with Low bike rental traffic subjected to Extended Fares, to increase income generation. |
Insights: |
|
|---|---|
Insight 1 |
The bike return traffic is not equally distributed over the end stations. There exists end stations with Very Low bike return activity. However this does not imply that these end stations are to be eliminated as they might incur good bike rental traffic and still prove to be a station that procure acceptable business metrics. |
Insight 2 |
Many end stations subjected to One Way trips experience a Low bike return traffic. When Round Trips are involved, the bike rental and bike returns are subjected to one and same bike station. Hence when a station experience a Low bike return traffic, it also implies a Low bike rental traffic subjected to the same station. However, unlike Round Trips, having a Low bike returns subjected to One Way trips will create a deficit in the availabitiy of the bikes in the specific station. Hence an immediate action is required to redirect bikes from stations with High and Very High bike return traffic to stations with Low and Very Low bike return traffic subjected to One Way trips to normalize the availability of bikes over all staions. |
Insight 3 |
A major number of end stations subjected to Round Trips experience Low bike return traffic. |
Insight 4 |
There are number of end stations that experiences Low and Very Low bike return activity and are clustered closely together. This represents that the bike type has no influence on the return activity at these stations. This is because of the bike returns in correlation with the bike rentals. So no action subjected to bike type is required to increase the bike return activity at the end stations with Low and Very Low bike return activity. |
Insight 5 |
Bike returns subjected to Annual pass type has high number of end stations with Low and Very Low return traffic. This is because, the Annual pass has high number of stations with Low and Very Low rental activity. |
Insight 6 |
There exists many end stations with relatively Low bike rental activity, subjected to One Day pass |
Insight 7 |
The bike returns subjected to Extended Fares incur a high number of end stations with Low return traffic and less number of stations with High return traffic. This denotes that Extended Fares are less desired by the customers. |
Insight 8 |
The bike returns subjected to Base Fares incur a high number of end stations with High return traffic and less number of stations with Low return traffic. This denotes that Base Fares are more preferred by the customers. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Promotions should be announced to encourage customers to opt for the Round Trips at the end stations with Low bike return traffic. |
Reform 2 |
Smart bike has less number of end stations with Normal and High bike return traffic compared to other bike types. This is because Smart bikes has less number of start stations with Normal and High bike rental traffic compared to other bike types. However some stations might incur less bike returns than bike rentals subjected to Smart bikes and this scenario will result in smart bike deficiency at these partivular stations. Hence the smart bikes returns should be adjusted/redirected to match the bike demand in the specific end stations. |
Reform 3 |
As there exists many end stations with relatively Low bike rental activity subjected to One Day pass, reforms should be taken to redirect the bike returns from other stations to match the demand for bikes at these end stations. |
Reform 4 |
Actions should be taken to encourage the customers to ride the bikes for longer durations to incur Extended Fares thus generating more income to the company. |
Insights: |
|
|---|---|
Insight 1 |
It appears that, the customers most frequently tend to return the bikes just after the Base Fare margin. Even if this scenario is one's own result of individual preference or behaviour, this might gives rise to customer's dissatisfaction of paying extra fee (extended fare), when one could avoid it with a little caution. This denotes that there is a requirement for launching a remainder to notify the expiration of Base Fare to the customers. |
Insight 2 |
It appears 16% of the bike rides are eligible for a 5 minute grace period from charging Extended Fare. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Launching a remainder by mobile notification/other sources to notify the expiration of Base Fare to the customers will alert the customer to return the bike to the nearest bike station to avoid Extended Fare will result in increased customer satisfaction. |
Reform 2 |
The alternate option is to give a 5 minute grace period to the Extended Fares. Since the percent of rides that are eligible for 5 minute grace period is far less compared to the rides with Extended Fares, the income generated from Extended Fares will only have a small impact/decrement. Hence diployment of Grace period is a fair option with little impact on the income generated from Extended Fares. |
Insights: |
|
|---|---|
Insight 1 |
It appears that the most of the customers prefer to take the rides with either Normal trip durations or small trip durations and avoids trips with Long and Very Long durations. |
Insight 2 |
The average trip duration of the Overall dataset is 30 minutes. However when the outliers are removed and re-evaluated, the average trip duration is down to 18 minutes. |
Insight 3 |
Also for trips under 30 minutes, the average trip duration is 12 minutes. |
Insight 4 |
The most frequent trip duration remains 6 minutes for all datasets limited by trip durations. This denoted that most customers rent the bikes for a quick short trip. |
Insight 5 |
The customers tend to travel longer when it comes to One Way trips compared to Round Trips. |
Insight 6 |
The most frequent trip duration remained same over the various dataset for One Way trips as 28 minutes. Meaning, most of the customers who prefer longer trips are using the full extent of Base Fare trip duration when it comes to One Way trips. |
Insight 7 |
The most frequent trip duration remained same over the various dataset for Round Trips as 5 minutes. Meaning, most of the customers prefer short trips when it comes to Round Trips. |
Insight 8 |
The customers prefer Smart bikes over other bike types when it comes to trips with longer durations. |
Insight 9 |
The difference between Smart bike and other bikes increases with the dataset limitation. Say Smart bikes are have high average trip duration under 120 minutes compared to 30 minutes. |
Insight 10 |
Standard and Electric bikes have close value of average trip durations. This denotes that these bike types have almost same customer's preference when it comes to trip durations. |
Insight 11 |
The bike rides subjected to short-term subscriptions like One Day and Walk-up pases have higher average trip durations compared to longterm subscriptions say Monthly and Annual pass. And the difference increases with the trip durations limit. |
Insight 12 |
The bike trips subjected to Extended fares has a heavy presence of outliers. When outliers are removed and dataset is limited to 120 minutes the average trip duration subjected to Extended Fares is 57 minutes. |
Insight 13 |
The most frequent trip duration of Extended Fare is 31 minutes. This denotes that many customers who intended to return the bikes for Base Fare failed in returning the bike under 30 minutes by a margin of 1 minute. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Most of the customers prefer to take the rides with either Normal trip durations or small trip durations and avoids trips with Long and Very Long durations. Even if this behaviour is to be expected, encouraging customers to ride the bikes for longer durations will generate income and improve business metrics. |
Reform 2 |
Organizing 2hr bike rallies and other events will attract enthusiasts to ride the bike for longer durations. |
Reform 3 |
Announcing Low Fares for tourists will attract them to rent the bike for longer durations. |
Insights: |
|
|---|---|
Insight 1 |
There is a fair amount of bike rental distribution when it comes to trip distances of Small, Normal and Large. This represents a productive business model covering all kinds of customer needs. |
Insight 2 |
Only a small number of customers exists who take Very Small and Very Long bike rides. This is also a good sign, as Very Small rides represent customer dissatisfaction and having more number of Very Long rides will result in dissipations of bike availability among a geographical location. |
Insight 3 |
A good amount of customers preferred Round Trips. |
Insight 4 |
One Way trips has an average trip duration of 0.8 miles, while the most frequent trip duration is 0.5 miles. |
Insight 5 |
Customers preferred Smart bikes for longer trip distances compared to other bike types. |
Insight 6 |
The bike rides subjected to Monthly pass has lower average bike rentals compared to other pass types. |
Insight 7 |
The short-term passes say One Day and Walk-up has higher value of frequent trip duration compared to longterm subscriptions. This is because the longterm passes are opted by working individuals for a ride to work, while short-term passes are preferred by tourists and explorers to explore city for longer rides. |
Insight 8 |
The trips with Extended Fares has a high value of frequent trip duration compared to Base Fare. This is to be expected as trip distances and trip durations are correlated. |
Reforms proposed: |
|
|---|---|
Reform 1 |
Organizing long bike rallies and other events will attract enthusiasts to ride the bike for longer distances. |
Credits:¶================
My sincere and deep gratitule for the
Udacity platformfor making thisData Analyst Nanodegreeavailable
Sebastian Thrun |
: INSTRUCTOR |
|---|---|
About |
As the founder and president of Udacity, Sebastian’s mission is to democratize education. He is also the founder of Google X, where he led projects including the Self-Driving Car, Google Glass, and more. |
Derek Steer |
: CEO AT MODE |
|---|---|
About |
Derek is the CEO of Mode Analytics. He developed an analytical foundation at Facebook and Yammer and is passionate about sharing it with future analysts. He authored SQL School and is a mentor at Insight Data Science. |
Mike Yi |
: INSTRUCTOR |
|---|---|
About |
Mike is a content developer with a multidisciplinary academic background, including math, statistics, physics, and psychology. Previously, he worked on Udacity's Data Analyst Nanodegree program as a support lead. |
Josh Bernhard |
: DATA SCIENTIST |
|---|---|
About |
Josh has been sharing his passion for data for nearly a decade at all levels of university, and as Lead Data Science Instructor at Galvanize. He's used data science for work ranging from cancer research to process automation. |
David Venturi |
: INSTRUCTOR |
|---|---|
About |
Formerly a chemical engineer and data analyst, David created a personalized data science master's program using online resources. He has studied hundreds of online courses and is excited to bring the best to Udacity students. |
Sam Nelson |
: PRODUCT LEAD |
|---|---|
About |
Sam Nelson is the Product Lead for Udacity’s Data Analyst, Business Analyst, and Data Foundations programs. He’s worked as an analytics consultant on projects in several industries, and is passionate about helping others improve their data skills. |
Juno Lee |
: CURRICULUM LEAD |
|---|---|
About |
Juno is the curriculum lead for the School of Data Science. She has been sharing her passion for data and teaching, building several courses at Udacity. As a data scientist, she built recommendation engines, computer vision and NLP models, and tools to analyze user behavior. |
Mat leonard |
: CURRICULUM LEAD |
|---|---|
About |
Mat, the curriculum lead is a former physicist, neuroscientist, and data scientist with a passion for education. Recently, he led the Deep learning Nanodegree foundation program covering state-of-the-art machine learning models. |
Vamshi Krishna Prime: Data Analyst
- Email : vamshi.krishna.prime@gmail.com
- website: https://www.vamshi-krishna.com
- GitHub : https://github.com/vamshi-krishna-prime
- LinkedIn : https://www.linkedin.com/in/vamshi-krishna-prime
Once you're ready to finish your presentation, check the output by using nbconvert to export the notebook and set up a server for the slides. From the terminal or command line, use the following expression:
jupyter nbconvert <file_name>.ipynb --to slides --post serve --template output_toggleThis should open a tab in your web browser where you can scroll through your presentation. Sub-slides can be accessed by pressing 'down' when viewing its parent slide. Make sure you remove all of the quote-formatted guide notes like this one before you finish your presentation!
|
|